Exploring Flexible Communications for Streamlining DNN Ensemble Training Pipelines

Latest revision as of 22:40, 1 February 2021

Abstract

Parallel training of a Deep Neural Network (DNN) ensemble on a cluster of nodes is a common practice to train multiple models in order to construct a model with a higher prediction accuracy, or to quickly tune the parameters of a training model. Existing ensemble training pipelines perform a great deal of redundant operations, resulting in unnecessary CPU usage, or even poor pipeline performance. In order to remove these redundancies, we need pipelines with more communication flexibility than existing DNN frameworks can provide. This project investigates a series of designs to improve pipeline flexibility and adaptivity, while also increasing performance. We implement our designs using Tensorflow with Horovod, and test it using several large DNNs in a large scale GPU cluster, the Titan supercomputer at Oak Ridge National Lab. Our results show that with the new flexible communication schemes, the CPU time spent during training is reduced by 2-11X. Furthermore, our implementation can achieve up to 10X speedups when CPU core limits are imposed. Our best pipeline also reduces the average power draw of the ensemble training process by 5--16% when compared to the baseline.

Original document

The different versions of the original document can be found in:

https://repository.lib.ncsu.edu/bitstream/1840.20/35263/1/etd.pdf

http://xplorestaging.ieee.org/ielx7/8657676/8665721/08665793.pdf?arnumber=8665793,

http://dx.doi.org/10.1109/sc.2018.00067

https://dblp.uni-trier.de/db/conf/sc/sc2018.html#PittmanGSLP18,

https://dl.acm.org/citation.cfm?id=3291742,

https://repository.lib.ncsu.edu/handle/1840.20/35263,

https://academic.microsoft.com/#/detail/2801648232

Latest revision as of 22:40, 1 February 2021

Abstract

Original document

Document information

Document Score

Share this document

Keywords

claim authorship

Revision as of 22:40, 1 February 2021 (view source) Scipediacontent (talk \| contribs) (Created page with " == Abstract == Parallel training of a Deep Neural Network (DNN) ensemble on a cluster of nodes is a common practice to train multiple models in order to construct a model wi...")	Latest revision as of 22:40, 1 February 2021 (view source) Scipediacontent (talk \| contribs) m (Scipediacontent moved page Draft Content 456286039 to Pittman et al 2019a)
(No difference)