Thursday, December 28, 2023

Study Note: Stochastic Gradient Descent, Dueling network, Neural Architecture Search

Mini-batch Stochastic Gradient Descent:

  • Stochastic Gradient Descent with pytorch:
    • keep the gradient in general (with  parameter momentum > 0)
    • work with large data by choosing a mini-batch - using DataLoader to randomly choose datapoints to update.

Dueling network:

  • Write the Q* function to be the sum of two neural networks:
    • Q(s, a) = V(s) + A(s,a) - max A(s,a)
    • Note that since Q depends on the sum of two neural networks,  under the same Q value, the two Neural networks could be unstable by shifting the center from network output 1 and adding the shifted portion to the network output 2.  Fortunately, the max A(s,a) term prevents this from happening, as changing shifting value would the  max A(s,a) term to change, resulted as varying Q. The max A(s,a) term in practice can be replaced with mean A(s,a) for better performance.

  

Neural Architecture Search

  • You have laid out a set of neural network layers for the same purpose, or maybe the esame neural network layer but with different choices of number of parameter - should I use 20 neurons or 30 neurons? or should ResNet works better one part of the layer than just FullyConnected? etc. The goal of the neural architecture search is to find out a better configuration.
    • Naive approach: give every configuration a try. If there are 3 candidates neural network layer, train on every one configuration. This becomes out of hand quickly when there are layers connected in series - the numbers of choices are multiplied.
    • Differentiable Neural Architecture Search:
      • Construct a super net by connecting all candidate layers that you want to try in 1 single network in parallel, and sum the results from each layer by a weighted sum using a Softmax layer. The weights are learnable parameters. Then train the super net. All layers will be trained, and their importance will be listed in the learned weight. The one with the largest weight is the winner. Keep that layer and throw away those layers with smaller weights.
      • It was explained in this video (Chinese). It is also possible to add running time to the  weight so that the choice from this process considers both accuracy and performance.


 

No comments: