Replacing the traditional forward and backward passes in a residual network with a Multigrid-Reduction-in-Time (MGRIT) algorithm paves the way for exploiting parallelism across the layer dimension. In this paper, we evaluate the layer-parallel MGRIT algorithm with respect to convergence, scalability, and performance on regression problems. Specifically, we demonstrate that a few MGRIT iterations solve the systems of equations corresponding to the forward and backward passes in ResNets up to reasonable tolerances. We also demonstrate that the MGRIT algorithm breaks the scalability barrier created by the sequential propagation of data during the forward and backward passes. Moreover, we show that ResNet training using the layer-parallel algorithm significantly reduces the training time compared to the layer-serial algorithm on two non-linear regression tasks. We observe much more efficient training loss curves using layer-parallel ResNets as compared to the layer-serial ResNets on two regression tasks. We hypothesize that the error stemming from approximately solving the forward and backward pass systems using the MGRIT algorithm helps the optimization algorithm escape flat saddle-point-like plateaus or local minima on the optimization landscape. We validate this by illustrating that artificially injecting noise in a typical forward or backward propagation, allows the optimizer to escape a saddle-point-like plateau at network initialization.
Are you one of the authors of this document?