Mini-batch Stochastic Gradient Descent:
- Stochastic Gradient Descent with pytorch:
- keep the gradient in general (with parameter momentum > 0)
- work with large data by choosing a mini-batch - using DataLoader to randomly choose datapoints to update.
- Pytorch SGD doesn't really do the stochastic part, despite of the naming
- Sample on datapoints with a higher loss at a higher probability.
- Give sample datapoints of a higher loss of a large learning rate, and give datapoints of a smaller loss at a smaller learning rate, as suggested in this RL: Experience Replay video (in Chinese).
Dueling network:
- Write the Q* function to be the sum of two neural networks:
- Q(s, a) = V(s) + A(s,a) - max A(s,a)
- Note that since Q depends on the sum of two neural networks, under the same Q value, the two Neural networks could be unstable by shifting the center from network output 1 and adding the shifted portion to the network output 2. Fortunately, the max A(s,a) term prevents this from happening, as changing shifting value would the max A(s,a) term to change, resulted as varying Q. The max A(s,a) term in practice can be replaced with mean A(s,a) for better performance.
Neural Architecture Search
- You have laid out a set of neural network layers for the same purpose, or maybe the esame neural network layer but with different choices of number of parameter - should I use 20 neurons or 30 neurons? or should ResNet works better one part of the layer than just FullyConnected? etc. The goal of the neural architecture search is to find out a better configuration.
- Naive approach: give every configuration a try. If there are 3 candidates neural network layer, train on every one configuration. This becomes out of hand quickly when there are layers connected in series - the numbers of choices are multiplied.
- Differentiable Neural Architecture Search:
- Construct a super net by connecting all candidate layers that you want to try in 1 single network in parallel, and sum the results from each layer by a weighted sum using a Softmax layer. The weights are learnable parameters. Then train the super net. All layers will be trained, and their importance will be listed in the learned weight. The one with the largest weight is the winner. Keep that layer and throw away those layers with smaller weights.
- It was explained in this video (Chinese). It is also possible to add running time to the weight so that the choice from this process considers both accuracy and performance.