Easy Last updated on May 7, 2022, 1:16 a.m.

As we know, Neural Networks use variants of the Gradient Descent Algorithm such as Adam, RMS Prop, and Adagrad for optimization. Training of Neural Network architectures requires optimizing a large number of heavily interdependent parameters. This process of learning many interdependent parameters consumes a considerable amount of data before the network reaches some local minima. On top of it, the optimization situation worsens due to the stochastic nature of batch gradient descent and the practice of learning rate modification throughout training.

We can think about it in terms of the flow of gradients as well. Thus, even though we need to feed the network lots of data, the gradient descent algorithm takes time to extract information. It can be compensated by using a limited number of samples and making multiple passes over the data. These multiple passes, also known as epochs, will allow the algorithm to converge without requiring an impractical amount of data.

Overall, we can say our optimization algorithm combined with interdependent variables inside neural networks is very data-hungry.

Frequently Asked Questions by

Amazon Microsoft