(Created blank page) |
m (Move page script moved page Datar Kostler 1970a to Datar Kostler 2022a) |
||
(4 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
+ | |||
+ | ==Summary== | ||
+ | Replacing the traditional forward and backward passes in a residual network with a Multigrid-Reduction-in-Time (MGRIT) algorithm paves the way for exploiting parallelism across the layer dimension. In this paper, we evaluate the layer-parallel MGRIT algorithm with respect to convergence, scalability, and performance on regression problems. Specifically, we demonstrate that a few MGRIT iterations solve the systems of equations corresponding to the forward and backward passes in ResNets up to reasonable tolerances. We also demonstrate that the MGRIT algorithm breaks the scalability barrier created by the sequential propagation of data during the forward and backward passes. Moreover, we show that ResNet training using the layer-parallel algorithm significantly reduces the training time compared to the layer-serial algorithm on two non-linear regression tasks. We observe much more efficient training loss curves using layer-parallel ResNets as compared to the layer-serial ResNets on two regression tasks. We hypothesize that the error stemming from approximately solving the forward and backward pass systems using the MGRIT algorithm helps the optimization algorithm escape flat saddle-point-like plateaus or local minima on the optimization landscape. We validate this by illustrating that artificially injecting noise in a typical forward or backward propagation, allows the optimizer to escape a saddle-point-like plateau at network initialization. | ||
+ | |||
+ | == Abstract == | ||
+ | <pdf>Media:Draft_Sanchez Pinedo_569355817455_abstract.pdf</pdf> | ||
+ | |||
+ | == Full Paper == | ||
+ | <pdf>Media:Draft_Sanchez Pinedo_569355817455_paper.pdf</pdf> |
Replacing the traditional forward and backward passes in a residual network with a Multigrid-Reduction-in-Time (MGRIT) algorithm paves the way for exploiting parallelism across the layer dimension. In this paper, we evaluate the layer-parallel MGRIT algorithm with respect to convergence, scalability, and performance on regression problems. Specifically, we demonstrate that a few MGRIT iterations solve the systems of equations corresponding to the forward and backward passes in ResNets up to reasonable tolerances. We also demonstrate that the MGRIT algorithm breaks the scalability barrier created by the sequential propagation of data during the forward and backward passes. Moreover, we show that ResNet training using the layer-parallel algorithm significantly reduces the training time compared to the layer-serial algorithm on two non-linear regression tasks. We observe much more efficient training loss curves using layer-parallel ResNets as compared to the layer-serial ResNets on two regression tasks. We hypothesize that the error stemming from approximately solving the forward and backward pass systems using the MGRIT algorithm helps the optimization algorithm escape flat saddle-point-like plateaus or local minima on the optimization landscape. We validate this by illustrating that artificially injecting noise in a typical forward or backward propagation, allows the optimizer to escape a saddle-point-like plateau at network initialization.
Published on 24/11/22
Accepted on 24/11/22
Submitted on 24/11/22
Volume Science Computing, 2022
DOI: 10.23967/eccomas.2022.239
Licence: CC BY-NC-SA license
Are you one of the authors of this document?