Introduction by Mohammad:
For this reading group we’ll do an overview of some of the recently proposed techniques for optimizing neural network parameters. SGD, Momentum, NAG and AdaGrad have been discussed in the Stanford CNN class.
In general what we want to know is:
- Why do we need these optimization methods ?
- What are each of them trying to solve?
- How can we connect them together?
- How can we know which one is useful for our models?
- Do we have a winner among all these methods?
- With adaptive methods, hyperparameter tuning is not as important;
- Well tuned Stochastic Gradient Descent is hard to be significantly beat.