SKDL19L2 take-outs

  • Dropout is similar to ensembling

  • Conjugate gradients are akin to momentum, mixing new direction with previous ones. That is quite different from my previous intuition that we "always move in new direction" which was correct... up to the metric.

  • Nesterov is about two things:

    • It is better to use gradient at the destination point than the gradient at the origin

    • With large momentum, we got a meaningful enough estimate of where we end up to compute that gradient

  • Mentioned this blog post again

  • Again this bullshit about SGD being "incorrect". It's not that SGD is not correct, it's that you define it the wrong way (being aware that it should be done in different way) and omit the explicit isomorphism.


