Dropout is similar to ensembling
Conjugate gradients are akin to momentum, mixing new direction with previous ones. That is quite different from my previous intuition that we "always move in new direction" which was correct... up to the metric.
Nesterov is about two things:
It is better to use gradient at the destination point than the gradient at the origin
With large momentum, we got a meaningful enough estimate of where we end up to compute that gradient
Mentioned this blog post again
Again this bullshit about SGD being "incorrect". It's not that SGD is not correct, it's that you define it the wrong way (being aware that it should be done in different way) and omit the explicit isomorphism.