Measuring the Effects of Data Parallelism on Neural Network Training
Recent hardware developments have made unprecedented amounts of data parallelism available for accelerating neural network training. Among the simplest ways to harness next-generation accelerators is to increase the batch size in standard mini-batch neural network training algorithms. In this work, we aim to experimentally characterize the effects of increasing the batch size on training time, as measured in the number of steps necessary to reach a goal out-of-sample error. Eventually, increasing the batch size will no longer reduce the number of training steps required, but the exact relationship between the batch size and how many training steps are necessary is of critical importance to practitioners, researchers, and hardware designers alike. We study how this relationship varies with the training algorithm, model, and dataset and find extremely large variation between workloads. Along the way, we reconcile disagreements in the literature on whether batch size affects model quality. Finally, we discuss the implications of our results for efforts to train neural networks much faster in the future.
Authors

Are you an author of this paper? Check the Twitter handle we have for you is correct.

Christopher J. Shallue (add twitter)
Jaehoon Lee (edit)
Joe Antognini (edit)
Jascha Sohl-Dickstein (add twitter)
Roy Frostig (add twitter)
George E. Dahl (edit)
Ask The Authors

Ask the authors of this paper a question or leave a comment.

Read it. Rate it.
#1. Which part of the paper did you read?

#2. The paper contains new data or analyses that is openly accessible?
#3. The conclusion is supported by the data and analyses?
#4. The conclusion is of scientific interest?
#5. The result is likely to lead to future research?

Github
User:
None (add)
Repo:
None (add)
Stargazers:
0
Forks:
0
Open Issues:
0
Network:
0
Subscribers:
0
Language:
None
Youtube
Link:
None (add)
Views:
0
Likes:
0
Dislikes:
0
Favorites:
0
Comments:
0
Other
Sample Sizes (N=):
Inserted:
Words Total:
Words Unique:
Source:
Abstract:
None
11/08/18 06:01PM
24,251
4,009
Tweets
bbaeksoondae: https://t.co/K9XwtU3wL2 Large batch works to some extent. Few models train efficiently at TPU Pod scale. Resnet50 has become new Dhrystone (https://t.co/T6hFffSuhb).
kiyoi_nohara: RT @mosko_mule: Measuring the Effects of Data Parallelism on Neural Network Training https://t.co/2s94aB8bs8 CNN,LSTM,Transformerとbatch siz…
caprest: RT @mosko_mule: Measuring the Effects of Data Parallelism on Neural Network Training https://t.co/2s94aB8bs8 CNN,LSTM,Transformerとbatch siz…
IainLJBrown: RT @KirkDBorne: Measuring the Effects of Data Parallelism in Training #NeuralNetworks: https://t.co/TmCrZISqIb #GoogleBrain #BigData #Data…
bluetopaz38: RT @mosko_mule: Measuring the Effects of Data Parallelism on Neural Network Training https://t.co/2s94aB8bs8 CNN,LSTM,Transformerとbatch siz…
HirokiMori: RT @mosko_mule: Measuring the Effects of Data Parallelism on Neural Network Training https://t.co/2s94aB8bs8 CNN,LSTM,Transformerとbatch siz…
pacehrb: RT @KirkDBorne: Measuring the Effects of Data Parallelism in Training #NeuralNetworks: https://t.co/TmCrZISqIb #GoogleBrain #BigData #Data…
ArkangelScrap: RT @KirkDBorne: Measuring the Effects of Data Parallelism in Training #NeuralNetworks: https://t.co/TmCrZISqIb #GoogleBrain #BigData #Data…
AI_Syndicate: RT @KirkDBorne: Measuring the Effects of Data Parallelism in Training #NeuralNetworks: https://t.co/TmCrZISqIb #GoogleBrain #BigData #Data…
akdm_bot: RT @KirkDBorne: Measuring the Effects of Data Parallelism in Training #NeuralNetworks: https://t.co/TmCrZISqIb #GoogleBrain #BigData #Data…
machinelearn_d: RT @statisticscom: Measuring the Effects of Data Parallelism in Training #NeuralNetworks: https://t.co/XfWDiPSqxy #GoogleBrain #BigData #D…
Datascience__: RT @KirkDBorne: Measuring the Effects of Data Parallelism in Training #NeuralNetworks: https://t.co/TmCrZISqIb #GoogleBrain #BigData #Data…
nullbytep: RT @mosko_mule: Measuring the Effects of Data Parallelism on Neural Network Training https://t.co/2s94aB8bs8 CNN,LSTM,Transformerとbatch siz…
biconnections: RT @KirkDBorne: Measuring the Effects of Data Parallelism in Training #NeuralNetworks: https://t.co/TmCrZISqIb #GoogleBrain #BigData #Data…
machinelearnflx: RT @KirkDBorne: Measuring the Effects of Data Parallelism in Training #NeuralNetworks: https://t.co/TmCrZISqIb #GoogleBrain #BigData #Data…
Trovolavorobiz: RT @KirkDBorne: Measuring the Effects of Data Parallelism in Training #NeuralNetworks: https://t.co/TmCrZISqIb #GoogleBrain #BigData #Data…
J3nTyrell: RT @statisticscom: Measuring the Effects of Data Parallelism in Training #NeuralNetworks: https://t.co/XfWDiPSqxy #GoogleBrain #BigData #D…
statisticscom: Measuring the Effects of Data Parallelism in Training #NeuralNetworks: https://t.co/XfWDiPSqxy #GoogleBrain #BigData #DataScience #AI #MachineLearning #DeepLearning #Algorithms #HPC #DataEngineering https://t.co/A0iHG7flbt
luislece: RT @KirkDBorne: Measuring the Effects of Data Parallelism in Training #NeuralNetworks: https://t.co/TmCrZISqIb #GoogleBrain #BigData #Data…
JamesBTS: RT @KirkDBorne: Measuring the Effects of Data Parallelism in Training #NeuralNetworks: https://t.co/TmCrZISqIb #GoogleBrain #BigData #Data…
KirkDBorne: Measuring the Effects of Data Parallelism in Training #NeuralNetworks: https://t.co/TmCrZISqIb #GoogleBrain #BigData #DataScience #AI #MachineLearning #DeepLearning #Algorithms #HPC #DataEngineering https://t.co/cUwq4WPY8k
syinari0123: RT @mosko_mule: Measuring the Effects of Data Parallelism on Neural Network Training https://t.co/2s94aB8bs8 CNN,LSTM,Transformerとbatch siz…
fabtar: Google Brain just published data parallelism and batch size in neural networks: "Measuring the Effects of Data Parallelism on Neural Network Training". https://t.co/a1w85ekzac
luckflow: RT SingularMattrix: Measuring the effects of data parallelism on neural network training. A great example of careful science in machine learning. https://t.co/vuaMCsOyny
ComputerPapers: Measuring the Effects of Data Parallelism on Neural Network Training. https://t.co/qyHXfwiOIh
jedisct1: RT @SingularMattrix: Measuring the effects of data parallelism on neural network training. A great example of careful science in machine learning. https://t.co/YKkzXq92Lt
toto_toilet: RT @mosko_mule: Measuring the Effects of Data Parallelism on Neural Network Training https://t.co/2s94aB8bs8 CNN,LSTM,Transformerとbatch siz…
Trtd6Trtd: RT @mosko_mule: Measuring the Effects of Data Parallelism on Neural Network Training https://t.co/2s94aB8bs8 CNN,LSTM,Transformerとbatch siz…
uc7xe5t: RT @mosko_mule: Measuring the Effects of Data Parallelism on Neural Network Training https://t.co/2s94aB8bs8 CNN,LSTM,Transformerとbatch siz…
kz_lil_fox: RT @mosko_mule: Measuring the Effects of Data Parallelism on Neural Network Training https://t.co/2s94aB8bs8 CNN,LSTM,Transformerとbatch siz…
icoxfog417: RT @mosko_mule: Measuring the Effects of Data Parallelism on Neural Network Training https://t.co/2s94aB8bs8 CNN,LSTM,Transformerとbatch siz…
nskm_m: RT @mosko_mule: Measuring the Effects of Data Parallelism on Neural Network Training https://t.co/2s94aB8bs8 CNN,LSTM,Transformerとbatch siz…
dcpage3: @PiotrCzapla Thanks for the feedback! Lots of interesting results to digest on the same subject in the new paper from Google AI https://t.co/VUbqq81r1u
reddit_ml: [R] Measuring the Effects of Data Parallelism on Neural Network Training (Google Brain) https://t.co/2waC44xzWx
dcpage3: Lots of evidence + a few puzzles for our theory of distinct NN training regimes dominated by catastrophic forgetting/curvature in the huge new study from @GoogleAI https://t.co/VUbqq81r1u. Background here: https://t.co/x8w3tgL07e
nanTosaka2: RT @mosko_mule: Measuring the Effects of Data Parallelism on Neural Network Training https://t.co/2s94aB8bs8 CNN,LSTM,Transformerとbatch siz…
kazoo04: RT @mosko_mule: Measuring the Effects of Data Parallelism on Neural Network Training https://t.co/2s94aB8bs8 CNN,LSTM,Transformerとbatch siz…
ankou06: RT @mosko_mule: Measuring the Effects of Data Parallelism on Neural Network Training https://t.co/2s94aB8bs8 CNN,LSTM,Transformerとbatch siz…
fumiaki_sato_: RT @mosko_mule: Measuring the Effects of Data Parallelism on Neural Network Training https://t.co/2s94aB8bs8 CNN,LSTM,Transformerとbatch siz…
jinbeizame007: RT @mosko_mule: Measuring the Effects of Data Parallelism on Neural Network Training https://t.co/2s94aB8bs8 CNN,LSTM,Transformerとbatch siz…
morioka: RT @mosko_mule: Measuring the Effects of Data Parallelism on Neural Network Training https://t.co/2s94aB8bs8 CNN,LSTM,Transformerとbatch siz…
daigo_hirooka: RT @mosko_mule: Measuring the Effects of Data Parallelism on Neural Network Training https://t.co/2s94aB8bs8 CNN,LSTM,Transformerとbatch siz…
mosko_mule: Measuring the Effects of Data Parallelism on Neural Network Training https://t.co/2s94aB8bs8 CNN,LSTM,Transformerとbatch sizeと学習率、必要な学習ステップetcの関係を徹底的に実験的に示している(資本力)。BSとステップ数、BSと実行学習率はBSが小さいときにしか比例しないとか色々。
nmfeeds: [O] https://t.co/mTxFjHer4w Measuring the Effects of Data Parallelism on Neural Network Training. Recent hardware developm...
NirantK: If you are into making training and inference times faster (e.g. for edge devices, or mobiles) - consider taking a look at this paper as well https://t.co/dOGhsgPWcT They explore mostly batchsize and the effects of such hyperparams
hardmaru: Measuring the Effects of Data Parallelism on Neural Network Training: Large empirical study investigating the tradeoffs between batch sizes, optimal learning rates, and training steps required. https://t.co/2pdnCdAy6N https://t.co/vxlsJN37kS
vedadian: دوستانی که نسبت به انتخاب اندازه‌ی batch موقع آموزش شبکه‌های عمیق حساسیت دارند -- که البته حساسیت بجایی هست -- ، نگاهی هم به این مقاله بندازند خوبه. https://t.co/bIr4ipzXWi
SalehCU: @mraginsky @SebastienBubeck @prfsanjeevarora Interesting work by colleagues at Google looking into SGD, SGD with momentum as well as Nesterov momentum as optimizers in a data- parallelism training framework: https://t.co/cLmNpNNzUY @roydanroy
hereticreader: [1811.03600] Measuring the Effects of Data Parallelism on Neural Network Training - https://t.co/9D1COPFWPY https://t.co/Klyntsfu7d
jaschasd: Everything you wanted to know about the role of batch size in neural net training, but didn't have the computational resources to ask! With Chris Shallue, Jaehoon Lee, Joe @joe_antognini, Roy Frostig, and George Dahl. https://t.co/TWKtsaJiRu
Robots_and_AIs: RT @SingularMattrix Measuring the effects of data parallelism on neural network training. A great example of careful science in machine learning. https://t.co/Y6G5qMa8ZI https://t.co/uZ4tTZCmdX
SingularMattrix: Measuring the effects of data parallelism on neural network training. A great example of careful science in machine learning. https://t.co/dVRm6R5Ejy
arxivml: "Measuring the Effects of Data Parallelism on Neural Network Training", Christopher J. Shallue, Jaehoon Lee, Joe An… https://t.co/DBuXZy923M
zacharynado: Awesome large scale, large batch paper! Turns out while you can scale the batch size pretty high, first order optimizers don't converge quicker after a certain batch size (across datasets and model sizes!) https://t.co/L5GFfvTMCq https://t.co/G1jtqD9fY4
BrundageBot: Measuring the Effects of Data Parallelism on Neural Network Training. Christopher J. Shallue, Jaehoon Lee, Joe Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E. Dahl https://t.co/0iFuOjH7S6
Images
Related