Javascript required
Skip to content Skip to sidebar Skip to footer

Reviews on Inefficiency of K-fac for Large Batch Size Training Article

1 Introduction

Many car learning tasks entail the minimization of the adventure,

, where is an i.i.d. sample from a information distribution, and is the per-example loss parametrized by

. In supervised learning, inputs and ground-truth labels comprise

, and is a vector of model parameters. Empirical take a chance approximates the population take chances past the risk of a sample ready , the preparation set up, as

. Empirical risk is ofttimes minimized using gradient-based optimization (offset-gild methods). For differentiable loss functions, the gradient of

is defined as , i.eastward. the slope of the loss with respect to the parameters evaluated at a point

. Popular in deep learning, Mini-batch Stochastic Gradient Descent (mini-batch SGD) iteratively takes small steps in the contrary direction of the boilerplate gradient of

training samples. The mini-batch size is a hyperparameter that provides flexibility in trading per-step computation time for potentially fewer total steps. In GD the mini-batch is the entire training set while in SGD it is a single sample.

In general, using any unbiased stochastic guess of the gradient and sufficiently modest pace sizes, SGD is guaranteed to converge to a minimum for diverse role classes

(Robbins and Monro, 1951). Common convergence premises in stochastic optimization improve with smaller gradient variance(Bottou et al., 2018). Mini-batch SGD is said to converge faster because the variance of the gradient estimates is reduced past a rate linear in the mini-batch size. In practise however, we observe diminishing returns in speeding up the training of well-nigh whatsoever deep model on deep learning benchmarks(Shallue et al., 2018). I explanation not studied by previous piece of work is that the variance numerically reaches zippo. The transition point to diminishing returns is known to depend on the choice of information, model and optimization method. Zhang et al. (2019) observed that the limitation of dispatch in large batches is reduced when momentum or preconditioning is used. Other works advise that very small mini-batch sizes can even so converge fast enough using a collection of tricks(Golmant et al., 2018; Masters and Luschi, 2018; Lin et al., 2020). 1 hypothesis is that the stochasticity due to minor mini-batches improves generalization by finding "flat minima" and avoiding "precipitous minima"(Goodfellow and Vinyals, 2015; Keskar et al., 2017). Simply this hypothesis does non explain why diminishing returns also happens in the training loss.

Motivated by the diminishing returns phenomena, nosotros study and model the distribution of the gradients. In the noisy gradient view, the average mini-batch gradient (or the mini-batch gradient) is treated equally an unbiased estimator of the expected gradient where increasing the mini-batch size reduces the variance of this figurer. We advise a distributional view and argue that knowledge of the gradient distribution tin be exploited to analyze and better optimization speed as well equally generalization to test data. A mean-enlightened optimization method is at best as strong every bit a distributional-aware optimization method. In our distributional view, the mini-batch gradient is only an estimate of the mean of the gradient distribution.

Questions: We identify the post-obit understudied questions about the gradient distribution.

  • Construction of gradient distribution. Is there structure in the distribution over gradients of standard learning problems?

  • Impact of gradient distribution on optimization. What characteristics of the gradient distribution correlate with the convergence speed and the minimum training/exam loss reached?

  • Bear upon of optimization on gradient distribution. To what extent practice the post-obit factors bear upon the gradient distribution: data distribution, learning charge per unit, model architecture, mini-batch size, optimization method, and the distance to local optima?

(a) GD step 0
(b) GD step 1
(c) GD footstep two
Figure ane: Example of clusters found using Gradient Clustering.

A linear classifier visualized during preparation with gradient descent on 2 linearly separable classes (o, x). Gradients are assigned to

clusters (unlike colors) using Slope Clustering (GC). Black line depicts current decision boundary. Colored dashed lines describe decision boundaries predicted from current boundary and each of the individual clusters. Here, bluish points belong to both classes; they have similar gradients, but are far apart in input space. By exploiting the knowledge of GC we can get low variance average mini-batch gradients.

Contributions:

  • Exploiting clustered distributions. We consider gradient distributions with singled-out modes, i.e. the gradients can be clustered. We show that the variance of average mini-batch gradient is minimized if the elements are sampled from a weighted clustering in gradient space (Section 3).

  • Efficient clustering to minimize variance. Nosotros propose Gradient Clustering (GC) every bit a computationally efficient method for clustering in the gradient infinite (Section 3.2). Fig. 1 shows an case of clusters establish by GC.

  • Relation between gradient variance and optimization.

    We study the gradient variance on common deep learning benchmarks (MNIST, CIFAR-10, and ImageNet) as well equally Random Features models recently studied in deep learning theory (

    Section iv). We observe that slope variance increases during training, and smaller learning rates coincide with higher variance.

  • An alternative statistic. We introduce normalized slope variance equally a statistic that improve correlates with the speed of convergence compared to gradient variance (Section 4).

We emphasize that some of our contributions are primarily empirical all the same unexpected. We encourage the reader to predict the behaviour of gradient variance before reaching our experiments section. Nosotros believe our results provide an opportunity for future theoretical and empirical work.

2 Related Work

Modeling slope distribution. Despite various assumptions on the mini-batch gradient variance, there are limited studies of this statistic during the training of deep learning models. It is common to presume divisional variance in convergence analyses(Bottou et al., 2018). Works on variance reduction suggest alternative estimates of the gradient mean with low variance(Le Roux et al., 2012; Johnson and Zhang, 2013) merely they do not plot the variance which is the actual quantity they seek to reduce. Their ineffectiveness in deep learning has been observed but still requires explanation(Defazio and Bottou, 2019). In that location are a few works that present gradient variance plots(Mohamed et al., 2019; Wen et al., 2019)

just they are unremarkably for a unmarried gradient coordinate and synthetic bug. The Central limit theorem is also used to argue that the distribution of the mini-batch gradient is a Gaussian

(Zhu et al., 2019), which has been challenged only recently(Simsekli et al., 2019). There likewise exists a link between the Fisher(Amari, 1998), Neural Tangent Kernel(Jacot et al., 2018), and the gradient covariance matrix(Martens, 2014; Kunstner et al., 2019; Thomas et al., 2020). Every bit such, any analysis of one(east.g. Karakida et al., 2019) could potentially be used to understand others.

The variance is rarely used for improving optimization. Le Roux et al. (2011)

considered the difference betwixt the covariance matrix of the gradients and the Fisher matrix and proposed incorporating the covariance matrix as a measure out of model doubt in optimization. Information technology has also been suggested that the partition past the second moments of the gradient in Adam can be interpreted as variance adaptation

(Kunstner et al., 2019). At that place are myriad papers on advertizing-hoc sampling and re-weighting methods for reducing dataset imbalance and increasing data diverseness(Bengio and Senecal, 2008; Jiang et al., 2017; Vodrahalli et al., 2018; Jiang et al., 2019). Although we do not use Slope Clustering for optimization, the formulation can be interpreted as a unifying arroyo that defines variance reduction every bit an objective.

Clustering gradients. Methods related to gradient clustering take been proposed in depression-variance gradient estimation(Hofmann et al., 2015; Zhao and Zhang, 2014; Chen et al., 2019) supported by promising theory. However, these methods accept either limited their experiments to linear models or treated a deep model as a linear one. Our proposed GC method performs efficient clustering in the gradient space with very few assumptions. GC is too related to works on model visualization where the entire training set is used to sympathize the behaviour of a model(Raghu et al., 2017).

3 Mini-batch Gradient with Stratified Sampling

An important factor affecting optimization trade-offs is the multifariousness of training data. SGD entails a sampling process, ofttimes uniformly sampling from the grooming set. Even so, every bit illustrated in the following example, uniform sampling is not always ideal. Suppose there are indistinguishable data points in a training ready. We tin can relieve computation time by removing all simply one of the duplicates. To go the aforementioned slope mean in expectation, it is sufficient to rescale the slope of the remaining sample in proportion to the number of duplicates. In this instance, mini-batch SGD will be inefficient because duplicates increment the variance of the average gradient mean.

Suppose nosotros are given i.i.d. preparation information, , and a division of their gradients, , into subsets, where is the size of the -th set. We can estimate the gradient hateful on the training set, , by averaging gradients, one from each of subsets, uniformly sampled:

where are the gradients from a subset , is the slope of the -th sample in the -th subset, and where is the alphabetize of the cluster to which -th information point is assigned, so . 1 1 oneWhile alphabetize data based on the partition, indexes the training points independent of partitioning. Each sample is treated as a representative of its subset and weighted by the size of that subset. In the limit of , we recover the batch gradient mean used in GD and for we recover the single-sample stochastic slope in SGD.

Proposition three.ane .

(Bias/Variance of Mini-batch Gradient with Stratified Sampling). For whatsoever sectionalisation of data, the computer of the gradient hateful using stratified sampling (Eq. 1) is unbiased ( ) and , where is divers equally the trace of the covariance matrix. (Proof in Section A.1)

Remark .

In a dataset with duplicate samples, the gradients of duplicates do non contribute to the variance if assigned to the aforementioned partition with no other data points.

3.1 Weighted Gradient Clustering

Suppose, for a given number of clusters, , we want to find the optimal division, i.e., one that minimizes the variance of the gradient mean computer, . For -dimensional gradient vectors, minimizing the variance in creftype 3.1, is equivalent to finding a weighted clustering of the gradients of data points,

where a cluster center, , is the boilerplate of the gradients in the -thursday cluster, and . If we did not have the gene

, this objective would be equivalent to the K-Means objective. The boosted

factors encourage larger clusters to take lower variance, with smaller clusters comprising scattered data points.

If we could shop the gradients for the entire grooming fix, the clustering could be performed iteratively equally a grade of block coordinate descent, alternating betwixt the following Assignment and Update steps, i.e., computing the cluster assignments and then the cluster centers:

(3)

(4)

The step is still too circuitous given the multiplier. As such, we first solve it for fixed cluster sizes and then update before some other step. These updates are similar to Lloyd's algorithm for K-Means, but with the

multipliers, and to Expectation-Maximization for Gaussian Mixture Models, merely hither we use difficult assignments. In contrast, the additional

multiplier makes the objective more circuitous in that performing updates does not always guarantee a decrease in the clustering objective.

3.2 Efficient Gradient Clustering (GC)

Performing verbal updates (Eqs. four andiii) is computationally expensive as they require the gradient of every data point. Deep learning libraries usually provide efficient methods that compute average mini-batch gradients without e'er calculating full private gradients. We introduce Gradient Clustering (GC) for performing efficient updates past breaking them into per-layer operations and introducing a low-rank approximation to cluster centers.

for to practice

for to exercise

for to exercise

Algorithm 1 footstep using Eq. 5

for to practice

Algorithm 2 update

for to practice

for to do

Algorithm 3 footstep using Eq. 6
Figure 2: Steps in Gradient Clustering

For any feed-frontwards network, we can decompose terms in updates into contained per-layer operations as shown in Fig. 2. The primary operations are computing and cluster updates per layer ; henceforth, we driblet the layer alphabetize for simplicity.

For a single fully-connected layer, we announce the layer weights by , where and denote the input and output dimensions for the layer. We denote the gradient with respect to for the preparation set by , where comprises the input activations to the layer, and represents the gradients with respect to the layer outputs. The coordinates of cluster centers respective to this layer are denoted by . We alphabetize the clusters using and the data past . The -th cluster center is approximated as , using vectors and .

In the pace we need to compute as part of the assignment cost, where is the Frobenius-norm. We expand this term into three inner-products, and compute them separately. In particular, the term tin can be written as,

where denotes inner production, and the RHS is the production of 2 scalars. Similarly, nosotros compute the other 2 terms in the expansion of the assignment price, i.e. and (Goodfellow (2015) proposed a similar thought to compute the slope norm).

The step in Eq. 4 is written as, . This equation might have no verbal solution for and because the sum of rank- matrices is not necessarily rank- . One approximation is the min-Frobenius-norm solution to

using truncated SVD, where nosotros apply left and right singular-vectors corresponding to the largest singular-value of the RHS. Even so, the following updates are exact if activations and gradients of the outputs are uncorrelated, i.e.

(similar to assumptions in One thousand-FAC(Martens and Grosse, 2015)),

In Section B.one, we describe similar update rules for convolutional layers and in Section B.2, we provide complexity analysis of GC. Nosotros tin make the toll of GC negligible by making sparse incremental updates to cluster centers using mini-batch updates. The assignment step can also exist made more efficient past processing just a portion of data every bit is mutual for training on large datasets. The rank- approximation tin exist extended to higher rank approximations with multiple independent cluster centers though with challenges in the implementation.

4 Experiments

In this section, we evaluate the accuracy of estimators of the gradient mean. This is a surrogate task for evaluating the performance of a model of the gradient distribution. Nosotros compare our proposed GC reckoner to average mini-batch Stochastic Gradient (SG-B), and SG-B with double the mini-batch size (SG-2B). SG-2B is an important baseline for 2 reasons. First, it is a competitive baseline that always reduces the variance by a factor of and requires at most twice the memory size and twice the run-time per mini-batch(Shallue et al., 2018). Second, the actress overhead of GC is approximately the same as keeping an extra mini-batch in the memory when the number of clusters is equal to the mini-batch size. Nosotros besides include Stochastic Variance Reduced Slope (SVRG)(Johnson and Zhang, 2013) equally a method with the sole objective of estimating gradient mean with depression variance.

We compare methods on a unmarried trajectory of mini-batch SGD to decouple the optimization from slope estimation. That is, nosotros do not railroad train with whatever of the estimators (hence no 'D' in SG-B and SG-2B). This allows united states to go on analyzing a method even later it fails in reducing the variance. For training results using SG-B, SG-2B and, SVRG, we refer the reader to Shallue et al. (2018); Defazio and Bottou (2019). For training with GC, it suffices to say that behaviours observed in this section are directly related to the functioning of GC used for optimization.

As all estimators in this work are unbiased, the reckoner with everyman variance is better estimating the gradient mean. We define Average Variance

(variance in short) as the average over all coordinates of the variance of the gradient hateful estimate for a stock-still model snapshot. Average variance is the normalized trace of the covariance matrix and of particular involvement in random matrix theory

(Tao, 2012). Nosotros likewise measure Normalized Variance, defined as where the variance of a

-dimensional random variable is divided past its second non-central moment. In signal processing, the inverse of this quantity is the signal to dissonance ratio (SNR). If SNR is less than one (normalized gradient larger than one), the ability of the noise is greater than the signal. Additional details of the experimental setup tin can exist found in

Appendix C.

iv.1 MNIST: Depression Variance, CIFAR-10: Noisy Estimates, ImageNet: No Structure

(a) MLP on MNIST
(b) ResNet8 on CIFAR-10
(c) ResNet18 on ImageNet
(d) MLP on MNIST
(e) ResNet8 on CIFAR-10
(f) ResNet18 on ImageNet
Figure iii: Image nomenclature models. Variance (height) and normalized variance plots (bottom). We observe: normalized variance correlates with optimization difficulty, variance is decreasing on MNIST but increasing on CIFAR-10 and ImageNet, and variance fluctuates with GC on CIFAR-10.

In this section, we study the evolution of gradient variance during training of an MLP on MNIST(LeCun et al., 1998), ResNet8(He et al., 2016) on CIFAR-10(Krizhevsky et al., 2009), and ResNet18 on ImageNet(Deng et al., 2009)

. Curves shown are from a single run and statistics are smoothed out over a rolling window. The standard deviation within the window is shown as a shaded area.

Normalized variance correlates with the time required to better accuracy. In Figs. 2(c), two(b) and2(a), the variance of SG-2B is always one-half the variance of SG-B. A drawback of the variance is that it is not comparable beyond different problems. For case, on CIFAR-ten the variance of all methods reaches while on ImageNet where normally more iterations are needed, the variance is below . In Figs. 2(f), ii(e) and2(d), normalized variance amend correlates with the convergence speed. Normalized variance on both MNIST and CIFAR-x is always beneath while on ImageNet information technology quickly goes above (noise stronger than gradient). Notice that the denominator in the normalized variance is shared between all methods on the same trajectory of mini-batch SGD. As such, the normalized variance retains the relation of curves and is a scaled version of variance where the scaling varies during training as the norm of the gradient changes. For clarity, nosotros but evidence the curve for SG-B.

How does the difficulty of optimization change during preparation? The variance on MNIST for all methods is constantly decreasing (Fig. 2(a)), i.eastward. the strength of dissonance decreases equally we become closer to a local optima. These plots advise that training an MLP on MNIST satisfies the Strong Growth Condition (SGC)(Schmidt and Le Roux, 2013) every bit the variance is numerically zero (below ). Normalized variance (Fig. two(d)) decreases over time and is well below (gradient mean has larger magnitude than the variance). SVRG performs particularly well by the end of the training because the training loss has converged to near nix (cross-entropy less than ). Promising published results with SVRG are usually on datasets similar to MNIST where the loss reaches relatively small-scale values. In contrast, on both CIFAR-10 (Figs. 2(e) and2(b)) and ImageNet (Figs. 2(f) andii(c)), the variance and normalized variance of all methods increase during training and particularly after the learning rate drops. This means gradient variance depends on the distance to local optima. We hypothesize that the slope of each training betoken becomes more unique as training progresses.

Variance tin widely modify during training but it happens only on a particularly noisy data. On CIFAR-10, the variance of GC suddenly goes up but comes back down before any updates to the cluster centers (Fig. ii(b)) while the variance of SVRG monotonically increases between updates. To explain these behaviours, observe that immediately after cluster updates, GC and SVRG should always have at well-nigh the same boilerplate variance as SG-B. We observed this behaviour consistently across dissimilar architectures such as other variations of ResNet and VGG on CIFAR-10. Fig. seven shows the effect of adding dissonance on CIFAR-x. Label smoothing(Szegedy et al., 2016) reduces fluctuations but not completely. On the other manus, label corruption, where we randomly change the labels for of the grooming information eliminates the fluctuations. We hypothesize that the model is oscillating betwixt different states with significantly different gradient distributions. The experiments with corrupt labels suggest that mislabeled data might be the cause of fluctuations such that having more randomness in the labels forces the model to ignore originally mislabeled data.

Is the slope distribution clustered in whatever dataset? The variance of GC on MNIST (Fig. two(a)) is consistently lower than SG-2B which means information technology is exploiting clustering in the gradient infinite. On CIFAR-10 (Fig. 2(b)) the variance of GC is lower than SG-B but non lower than SG-2B except when fluctuating. The improved variance is more than noticeable when training with corrupt labels. On ImageNet (Figs. 2(f) and2(c)

), the variance of GC is overlapping with SG-B. An example of a gradient distribution where GC is overlapping with SG-B is a compatible distribution. Gradient distribution can still be structured. For example, there could exist clusters in subspaces of the gradient space.

4.2 Random Features Models: How Does Overparametrization Affect the Variance?

The Random Features (RF) model(Rahimi and Recht, 2007) provides an effective style to explore the behaviour of optimization methods across a family of learning bug. The RF model facilitates the discovery of optimization behaviours including the double-descent shape of the take a chance curve(Hastie et al., 2019; Mei and Montanari, 2019). We train a student RF model with hidden dimensions on a stock-still training prepare, , , sampled from a model, , where

is the ReLU activation function, and the teacher hidden features

, and second layer weights and bias,

, are sampled from the standard normal distribution. Each

dimensional random feature of the instructor is scaled to norm 1. We train a student RF model with random features and 2d layer weights by minimizing the cross-entropy loss. In Fig. four, we train hundreds of Random Features models and plot the average variance and normalized variance of gradient estimators. We show both maximum and mean of the statistics during preparation. The maximum better captures fluctuations of a gradient calculator and allows us to link our observations of variance to generalization using standard convergence bounds that rely on bounded noise(Bottou et al., 2018).

(a) SGD trajectories
(b) Trajectory of LR=
(c) Trajectory of LR=
Figure four: Random Features models. Variance (log-calibration) versus the over-parametrization coefficient (student's subconscious divided by the training gear up size). We notice: teacher's hidden is not influential, variance is low in overparametrized government, and with larger learning rates. We aggregate results from hyperparameters not shown.
(a) SGD max norm. var. (b) SGD mean norm. var.
Figure five: Normalized variance on overparam. RF is less than .
(a) Label smoothing (b) Decadent labels
Figure half dozen: CIFAR-10 Fluctuations disappear with decadent labels.
(a) CIFAR-ten (b) CIFAR-100
Effigy 7: Image nomenclature with duplicates exploited past GC.

Practise models with small generalization gap converge faster? Based on small error bars, the only hyperparameters that affect the variance are learning charge per unit and the ratio of the size of the student subconscious layer over the training gear up size. In contrast, in analysis of risk and the double descent phenomena, we normally detect a dependence on ratio of the student hidden layer size to the teacher hidden layer size(Mei and Montanari, 2019). This suggests that models that generalize better are non necessarily ones that train faster.

Does "diminishing returns" happen because of overparametrization? Figs. 3(c) and3(b) show that with the same learning rate, all methods attain similar variance in the overparametrized regime. Annotation that due to the normalization of random features, the gradients in each coordinate are expected to decrease as overparametrization increases. We conjecture that the diminishing returns in increasing the mini-batch size should also be observed in overparametrized random features models like to linear and deeper models(Zhang et al., 2019; Shallue et al., 2018).

Why does the loss usually drop immediately after a learning rate drop? Fig. 3(a) shows that the variance is smaller for trajectories with larger learning rates and that the gap grows equally overparametrization grows. This is a straight consequence of the dependence of the dissonance in the gradient on electric current parameters. In Section 4.1 we observe the opposite of this behaviour in deep models. In contrast, Fig. 7 shows that for overparametrization less than , all trajectories have similar normalized variance that is larger than ane (noise is more powerful than the slope). As such, we hypothesize that the reduction in variance is not the sole reason for the immediate subtract in the loss subsequently a learning charge per unit drop(eastward.g. He et al., 2016).

How does SGD avoid local minima? In Fig. 4(a), the error confined in the maximum normalized variance are long for overparametrization coefficient less than . The reason is in some iterations the 2d moment of the slope for some coordinates gets close to zilch just the noise due to mini-batching is yet non-zero. Frequently in the next iteration, the gradient becomes large again. This is an example where SGD avoids local minima due to noise.

Why does SVRG fail in deep learning? Figs. 3(c) andiii(b) show that the proceeds of SVRG vanishes in the over-parametrized regime ( ) where all methods have relatively low variance (below ). We hypothesize that the cause is the generally lower variance of the noise in the overparametrized authorities rather than the staleness of the control-variate.

4.three Duplicates: Back to the Motivation for Gradient Clustering

(a) duplicates
(b) duplicates
(c) duplicates
Figure 8: Training RF Models with Duplicates. GC identifies and exploits duplicates. Plots are like to Fig. iv. Learning rate in all three is . In each grooming, at that place are data points that are repeated equally to make upwardly (left), (heart), and (right) of the training set.

In Fig. 8, we trained random features models with additional duplicated data points. We detect that as the ratio of duplicates to non-duplicates increases, the gap between the variance of GC and other methods ameliorate. Without indistinguishable data, GC is ever between SG-B and SG-2B. It is almost never worse than SG-B and never better than SG-2B. GC is as good as SG-2B at mild overparametrization ( ). We need a degree of overparametrization for GC to reduce the variance but besides much overparametrization leaves no room for improvement. When duplicates exist, GC performs well with a gap that does not decrease by overparametrization.

Similarly, experiments on CIFAR-10 and CIFAR-100 (Fig. seven) show that GC significantly reduces the variance when duplicate data points exist. Note that because of common data augmentations, duplicate information points are not exactly duplicate in the input space and at that place is no guarantee that their gradients would be like.

5 Conclusion

In this work we introduced tools for agreement the optimization behaviour and explaining previously perplexing observations in deep learning. We look our contributions to non only guide improvements to optimization speed and generalization but also the design of interpretable models. We have provided prove that structured slope distributions such as clustered gradients be and that statistics of gradients can provide insight into the optimization operation. Nevertheless, exploiting this knowledge to improve optimization has proven to be challenging.

The authors would like to thank Nicolas Le Roux, Fabian Pedregosa, Mark Schmidt, Roger Grosse, Sara Sabour, and Aryan Arbabi for helpful discussions and feedbacks on this manuscript. Resources used in preparing this research were provided, in part, past the Province of Ontario, the Regime of Canada through CIFAR, and companies sponsoring the Vector Institute.

References

  • Amari (1998) Shun-Ichi Amari. Natural slope works efficiently in learning. Neural ciphering, ten(two):251–276, 1998.
  • Bengio and Senecal (2008) Yoshua Bengio and Jean-Sébastien Senecal. Adaptive importance sampling to accelerate preparation of a neural probabilistic language model.

    IEEE Trans. Neural Networks

    , xix(4):713–722, 2008.
  • Bottou et al. (2018) Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for big-scale motorcar learning. SIAM Review, lx(2):223–311, 2018.
  • Chen et al. (2019) Beidi Chen, Yingchen Xu, and Anshumali Shrivastava. Fast and accurate stochastic gradient interpretation. In Neural Information Processing Systems (NeurIPS), pages 12339–12349, 2019.
  • Defazio and Bottou (2019) Aaron Defazio and Léon Bottou. On the ineffectiveness of variance reduced optimization for deep learning. In Neural Information Processing Systems (NeurIPS), pages 1753–1763, 2019.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical epitome database. In

    Briefing on Computer Vision and Blueprint Recognition (CVPR)

    , pages 248–255. IEEE Figurer Society, 2009.
  • Golmant et al. (2018) Noah Golmant, Nikita Vemuri, Zhewei Yao, Vladimir Feinberg, Amir Gholami, Kai Rothauge, Michael W. Mahoney, and Joseph Gonzalez. On the Computational Inefficiency of Big Batch Sizes for Stochastic Gradient Descent. arXiv e-prints, art. arXiv:1811.12941, Nov 2018.
  • Goodfellow (2015) Ian Goodfellow. Efficient Per-Example Gradient Computations. arXiv e-prints, art. arXiv:1510.01799, October 2015.
  • Goodfellow and Vinyals (2015) Ian J. Goodfellow and Oriol Vinyals. Qualitatively characterizing neural network optimization problems. In International Conference on Learning Representations (ICLR), 2015.
  • Hastie et al. (2019) Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J. Tibshirani. Surprises in High-Dimensional Ridgeless Least Squares Interpolation. arXiv e-prints, art. arXiv:1903.08560, March 2019.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residuum learning for paradigm recognition. In Conference on Figurer Vision and Design Recognition (CVPR), pages 770–778. IEEE Computer Society, 2016.
  • Hofmann et al. (2015) Thomas Hofmann, Aurélien Lucchi, Simon Lacoste-Julien, and Brian McWilliams. Variance reduced stochastic gradient descent with neighbors. In Neural Data Processing Systems (NeurIPS), pages 2305–2313, 2015.
  • Jacot et al. (2018) Arthur Jacot, Clément Hongler, and Franck Gabriel. Neural tangent kernel: Convergence and generalization in neural networks. In Neural Information Processing Systems (NeurIPS), pages 8580–8589, 2018.
  • Jiang et al. (2019) Angela H Jiang, Daniel L-K Wong, Giulio Zhou, David G Andersen, Jeffrey Dean, Gregory R Ganger, Gauri Joshi, Michael Kaminksy, Michael Kozuch, Zachary C Lipton, et al. Accelerating deep learning by focusing on the biggest losers. arXiv preprint arXiv:1910.00762, 2019.
  • Jiang et al. (2017) Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. MentorNet: Learning Information-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels. arXiv e-prints, fine art. arXiv:1712.05055, Dec 2017.
  • Johnson and Zhang (2013) Rie Johnson and Tong Zhang. Accelerating stochastic slope descent using predictive variance reduction. In Neural Information Processing Systems (NeurIPS), pages 315–323, 2013.
  • Karakida et al. (2019) Ryo Karakida, Shotaro Akaho, and Shun-ichi Amari. Pathological spectra of the fisher information metric and its variants in deep neural networks. arXiv preprint arXiv:1910.05992, 2019.
  • Keskar et al. (2017) Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In International Briefing on Learning Representations (ICLR). OpenReview.internet, 2017.
  • Krizhevsky et al. (2009) Alex Krizhevsky et al. Learning multiple layers of features from tiny images. 2009.
  • Kunstner et al. (2019) Frederik Kunstner, Philipp Hennig, and Lukas Balles. Limitations of the empirical fisher approximation for natural gradient descent. In Neural Data Processing Systems (NeurIPS), pages 4158–4169, 2019.
  • Le Roux et al. (2011) Nicolas Le Roux, Yoshua Bengio, and Andrew Fitzgibbon. Improving first and 2nd-order methods by modeling uncertainty. Optimization for Auto Learning, folio 403, 2011.
  • Le Roux et al. (2012) Nicolas Le Roux, Mark Schmidt, and Francis R. Bach. A stochastic gradient method with an exponential convergence rate for finite preparation sets. In Neural Information Processing Systems (NeurIPS), pages 2672–2680, 2012.
  • LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to certificate recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Lin et al. (2020) Tao Lin, Sebastian U. Stich, Kumar Kshitij Patel, and Martin Jaggi. Don't apply big mini-batches, use local SGD. In International Briefing on Learning Representations (ICLR). OpenReview.internet, 2020.
  • Martens (2014) James Martens. New insights and perspectives on the natural slope method. arXiv eastward-prints, art. arXiv:1412.1193, December 2014.
  • Martens and Grosse (2015) James Martens and Roger B. Grosse. Optimizing neural networks with kronecker-factored estimate curvature. In International Conference on Automobile Learning (ICML), book 37 of JMLR Workshop and Conference Proceedings, pages 2408–2417. JMLR.org, 2015.
  • Masters and Luschi (2018) Dominic Masters and Carlo Luschi. Revisiting Small Batch Training for Deep Neural Networks. arXiv e-prints, art. arXiv:1804.07612, April 2018.
  • Mei and Montanari (2019) Vocal Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv e-prints, art. arXiv:1908.05355, August 2019.
  • Mohamed et al. (2019) Shakir Mohamed, Mihaela Rosca, Michael Figurnov, and Andriy Mnih. Monte carlo gradient estimation in machine learning. arXiv preprint arXiv:1906.10652, 2019.
  • Raghu et al. (2017) Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. SVCCA: atypical vector canonical correlation analysis for deep learning dynamics and interpretability. In Neural Information Processing Systems (NeurIPS), pages 6076–6085, 2017.
  • Rahimi and Recht (2007) Ali Rahimi and Benjamin Recht. Random features for large-calibration kernel machines. In NIPS, pages 1177–1184. Curran Assembly, Inc., 2007.
  • Robbins and Monro (1951) Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
  • Schmidt and Le Roux (2013) Mark Schmidt and Nicolas Le Roux. Fast convergence of stochastic gradient descent nether a strong growth condition. arXiv preprint arXiv:1308.6370, 2013.
  • Shallue et al. (2018) Christopher J. Shallue, Jaehoon Lee, Joseph Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George Due east. Dahl. Measuring the Furnishings of Data Parallelism on Neural Network Grooming. arXiv e-prints, art. arXiv:1811.03600, Nov 2018.
  • Simsekli et al. (2019) Umut Simsekli, Levent Sagun, and Mert Gürbüzbalaban. A tail-index analysis of stochastic slope noise in deep neural networks. In International Conference on Machine Learning (ICML), volume 97 of Proceedings of Auto Learning Research, pages 5827–5837. PMLR, 2019.
  • Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for calculator vision. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826. IEEE Figurer Club, 2016.
  • Tao (2012) Terence Tao. Topics in random matrix theory, volume 132. American Mathematical Soc., 2012.
  • Thomas et al. (2020) Valentin Thomas, Fabian Pedregosa, Bart van Merriënboer, Pierre-Antoine Manzagol, Yoshua Bengio, and Nicolas Le Roux. On the interplay betwixt noise and curvature and its consequence on optimization and generalization. In

    International Briefing on Artificial Intelligence and Statistics (AISTATS)

    , 2020.
  • Vodrahalli et al. (2018) Kailas Vodrahalli, Ke Li, and Jitendra Malik. Are All Training Examples Created Equal? An Empirical Report. arXiv east-prints, art. arXiv:1811.12569, November 2018.
  • Wen et al. (2019) Yeming Wen, Kevin Luk, Maxime Gazeau, Guodong Zhang, Harris Chan, and Jimmy Ba. An Empirical Report of Large-Batch Stochastic Gradient Descent with Structured Covariance Noise. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.
  • Zhang et al. (2019) Guodong Zhang, Lala Li, Zachary Nado, James Martens, Sushant Sachdeva, George E. Dahl, Christopher J. Shallue, and Roger B. Grosse. Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. In Neural Information Processing Systems (NeurIPS), pages 8194–8205, 2019.
  • Zhao and Zhang (2014) Peilin Zhao and Tong Zhang. Accelerating Minibatch Stochastic Slope Descent using Stratified Sampling. arXiv due east-prints, fine art. arXiv:1405.3080, May 2014.
  • Zhu et al. (2019) Zhanxing Zhu, Jingfeng Wu, Bing Yu, Lei Wu, and Jinwen Ma. The anisotropic racket in stochastic gradient descent: Its behavior of escaping from precipitous minima and regularization furnishings. In International Conference on Machine Learning (ICML), book 97 of Proceedings of Machine Learning Research, pages 7654–7663. PMLR, 2019.

Appendix A Additional Details of Slope Clustering (Section iii)

a.one Proof of creftype 3.1

The gradient calculator, , is unbiased for whatever sectionalisation of data, i.e. equal to the average gradient of the training set up,

where nosotros use the fact that the expectation of a random sample fatigued uniformly from a subset is equal to the expectation of the boilerplate of samples from that subset. Too note that the slope of every training example appears once in .

Although partitioning does not touch the bias of , it does impact the variance,

where the variance is defined as the trace of the covariance matrix. Since nosotros assume the training set up is sampled i.i.d., the covariance betwixt gradients of any ii samples is nothing. In a dataset with duplicate samples, the gradients of duplicates volition be clustered into one cluster with nix variance if mingled with no other data points.

Appendix B Additional Details of Efficient GC (Section 3.2)

b.1 Convolutional Layers

In neural networks, the convolution performance is performed equally an inner production between a set of weights , namely kernels, by patches of size in the input. Assuming that we have preprocessed the input by extracting patches, the gradient w.r.t. is , is the gradient at the spatial location and is the flattened dimension of a patch. The gradient at spatial location is computed as .

Like the fully-connected case, we use a rank- approximation to the cluster centers in a convolution layer, defining . Every bit such, steps are performed efficiently. For the step nosotros rewrite ,

where the input dimension is indexed by and the output dimension is indexed by . Eqs.8 andnine provide two ways of computing the inner-product, where we first compute the inner sums, and then the outer sum. The efficiency of each formulation depends on the size of the kernel and layer'southward input and output dimensions.

b.2 Complexity Analysis

Operation FC Complexity Conv Complexity
Dorsum-prop
footstep See Sec.B.2
footstep
Tabular array 1: Complexity of GC compared to the price of back-prop.

GC, described in Fig. ii, performs two sets of operations, namely, the cluster center updates ( step), and the assignment update of information to clusters ( pace). steps instantly affect the optimization by changing the sampling process. As such, we perform an

step every few epochs and modify the sampling right afterward. In contrast, the

step tin be washed in parallel and more frequently than the step, or online using mini-batch updates. The cost of both steps is amortized over optimization steps.

Table1 summarizes the run-time complexity of GC compared to the cost of single SGD step. The footstep is always cheaper than a unmarried dorsum-prop pace. The step is cheaper for fully-connected layers if .

For convolutional layers, we have two means to compute the terms in the pace (Eqs.8 and9). For , if , Eq.9 is more efficient. For , Eq.ix is more than efficient if . If , both methods take lower complication than a single back-prop step. If we did non have the multiplier in the step, we could ignore the ciphering of the norm of the gradients, and hence further reduce the price.

In common neural network architectures, the condition is easily satisfied as in all layers is almost always more than and usually greater than , while - clusters provides significant variance reduction. As such, the total overhead price with an efficient implementation is at about the cost of a normal dorsum-prop step. Nosotros can further reduce this cost past performing GC on a subset of the layers, e.1000. 1 might exclude the lowest convolutional layers.

The total memory overhead is equivalent to increasing the mini-batch size by samples as nosotros just need to store rank- approximations to the cluster centers.

Appendix C Additional Details for Experiments (Section 4)

The mini-batch size in GC and SVRG and the number of clusters in GC are the same as the mini-batch size in SG-B and the same as the mini-batch size used for training using SGD. To measure the gradient variance, we take snapshots of the model during preparation, sample tens of mini-batches from the training ready (in case of GC, with stratified sampling), and measure the boilerplate variance of the gradients.

Nosotros mensurate the performance metrics (e.grand. loss, accurateness and variance) every bit functions of the number of grooming iterations rather than wall-clock time. In other words, we exercise not consider computational overhead of unlike methods. In practice, such assay is valid every bit long as the additional operations could exist parallelized with negligible cost.

c.i Experimental Details for Epitome Classification Models (Section four.1)

On MNIST, our MLP model consists of there fully connected layers: layer1: , layer2: , layer3: . We employ ReLU activations and no dropout in this MLP. Nosotros railroad train all methods with learning rate , weight disuse , and momentum

. On CIFAR-x, nosotros train ResNet8 with no batch normalization layer and learning charge per unit

, weight decay , and momentum for iterations. Nosotros decay the learning rate at and iterations by a factor of . On CIFAR-100, we railroad train ResNet32 starting with learning rate . Other hyper-parameters are the same as in CIFAR-ten. On ImageNet, we train ResNet18 starting with learning rate , weight decay , and momentum . We utilise a similar learning rate schedule to CIFAR-10.

Dataset Model T Log T Estim T U GC T
MNIST MLP
CIFAR-10 ResNet8
CIFAR-100 ResNet32
ImageNet ResNet18
Table 2: Hyperparameters.

In Table 2 nosotros list the post-obit hyperparameters: the interval of measuring gradient variance and normalized variance (Log T), number of gradient estimates used on measuring variance (Estim T), the interval of updating the control variate in SVRG and the clustering in GC (U), and the number of GC update iterations (GC T).

In plots for random features models, each bespeak is generated past keeping fixed at and varying in the range . We average over random seeds, instructor subconscious dimensions and input dimensions (both and student hidden). We use mini-batch size for SG-B, SVRG, and GC.

A rough estimate of the overparametrization coefficient (discussed in Department 4.2) for deep models is to divide the total number of parameters by the number of preparation data. On MNIST the coefficient is approximately for CNN and for MLP. On CIFAR-ten information technology is approximately for ResNet8 and for ResNet32. Common data augmentations increase the effective preparation ready size by . On the other paw, the depth potentially increases the chapters of models exponentially (cite the paper that theoretically says how many data points a model can memorize). As such, it is difficult to straight relate these numbers to the behaviours observed in RF models.

c.two Experimental Details for Random Features Models (Department iv.2)

The number of training iterations is chosen such that the preparation loss has flattened. The maximum is taken over the last of iterations (the variance is usually loftier for all methods in the first ). Mean variance plots for random features models are like to max variance plots presented in Section 4.two.

Figure 9: Mean variance plots for Fig.4

We amass results from multiple experiments with the following range of hyperparameters. Each point is generated by keeping fixed at and varying in the range . Nosotros average over random seeds, teacher hidden dimensions and input dimensions (both and student hidden).

vaudehicte1984.blogspot.com

Source: https://deepai.org/publication/a-study-of-gradient-variance-in-deep-learning