Yuhan's blog: December 2023

Thursday, December 28, 2023

Mini-batch Stochastic Gradient Descent:

keep the gradient in general (with parameter momentum > 0)
work with large data by choosing a mini-batch - using DataLoader to randomly choose datapoints to update.

Pytorch SGD doe sn't really do the stochastic part, despite of the naming
Sample on datapoints with a higher loss at a higher probability.
Give sample datapoints of a higher loss of a large learning rate, and give datapoints of a smaller loss at a smaller learning rate, as suggested in this RL: Experience Replay video (in Chinese).

Dueling network:

Q(s, a) = V(s) + A(s,a) - max A(s,a)
Note that since Q depends on the sum of two neural networks, under the same Q value, the two Neural networks could be unstable by shifting the center from network output 1 and adding the shifted portion to the network output 2. Fortunately, the max A(s,a) term prevents this from happening, as changing shifting value would the max A(s,a) term to change, resulted as varying Q. The max A(s,a) term in practice can be replaced with mean A(s,a) for better performance.

Neural Architecture Search

You have laid out a set of neural network layers for the same purpose, or maybe the esame neural network layer but with different choices of number of parameter - should I use 20 neurons or 30 neurons? or should ResNet works better one part of the layer than just FullyConnected? etc. The goal of the neural architecture search is to find out a better configuration.

Naive approach: give every configuration a try. If there are 3 candidates neural network layer, train on every one configuration. This becomes out of hand quickly when there are layers connected in series - the numbers of choices are multiplied.
Differentiable Neural Architecture Search:

Construct a super net by connecting all candidate layers that you want to try in 1 single network in parallel, and sum the results from each layer by a weighted sum using a Softmax layer. The weights are learnable parameters. Then train the super net. All layers will be trained, and their importance will be listed in the learned weight. The one with the largest weight is the winner. Keep that layer and throw away those layers with smaller weights.
It was explained in this video (Chinese). It is also possible to add running time to the weight so that the choice from this process considers both accuracy and performance.

Bipartite graph is one with two sets of nodes. Edges can exist between the sets, but not among a set.
Algorithm to identify if a Graph is Bipartite: given 2 colors, traverse nodes color them color them and color the neighbor nodes with different colors. If a node has a conflicted color with earlier color decision, the graph is not a Bipartite graph.

Define Matching: pairing nodes from two sets in a Bipartite graph, without using repeating elements. (two nodes cannot share the same partner node)
Max-cardinality Bipartite Matching - include as many pairs as possible.

Greedy algorithm: iterate through nodes in one side, and find one partner node. (Skip the node if no partner can be found)
Convert Matching problem into Network Flow

By connecting a source node to 1 set and connecting a sink to the other set. Then find network that could flow through from source to sink

Solve the network flow problem by Ford-Fulkerson algorithm, and the result paths gave the optimized matching.

Let edges between 2 sets of nodes to be a matrix. Each edge has a weight. The goal is to find the matching with the max total weights.

prerequisite: the cardinality of of the two sets must be the same. That means the number of nodes must be the same.
Time complexity: O(n**3)
Negating the weights converts the algorithm from finding the minimum total weight to maximum total weight.