Thursday, December 28, 2023

Study Note: Stochastic Gradient Descent, Dueling network, Neural Architecture Search

Mini-batch Stochastic Gradient Descent:

  • Stochastic Gradient Descent with pytorch:
    • keep the gradient in general (with  parameter momentum > 0)
    • work with large data by choosing a mini-batch - using DataLoader to randomly choose datapoints to update.

Dueling network:

  • Write the Q* function to be the sum of two neural networks:
    • Q(s, a) = V(s) + A(s,a) - max A(s,a)
    • Note that since Q depends on the sum of two neural networks,  under the same Q value, the two Neural networks could be unstable by shifting the center from network output 1 and adding the shifted portion to the network output 2.  Fortunately, the max A(s,a) term prevents this from happening, as changing shifting value would the  max A(s,a) term to change, resulted as varying Q. The max A(s,a) term in practice can be replaced with mean A(s,a) for better performance.

  

Neural Architecture Search

  • You have laid out a set of neural network layers for the same purpose, or maybe the esame neural network layer but with different choices of number of parameter - should I use 20 neurons or 30 neurons? or should ResNet works better one part of the layer than just FullyConnected? etc. The goal of the neural architecture search is to find out a better configuration.
    • Naive approach: give every configuration a try. If there are 3 candidates neural network layer, train on every one configuration. This becomes out of hand quickly when there are layers connected in series - the numbers of choices are multiplied.
    • Differentiable Neural Architecture Search:
      • Construct a super net by connecting all candidate layers that you want to try in 1 single network in parallel, and sum the results from each layer by a weighted sum using a Softmax layer. The weights are learnable parameters. Then train the super net. All layers will be trained, and their importance will be listed in the learned weight. The one with the largest weight is the winner. Keep that layer and throw away those layers with smaller weights.
      • It was explained in this video (Chinese). It is also possible to add running time to the  weight so that the choice from this process considers both accuracy and performance.


 

Study note: Bipartite Graph

14-1Bipartite Graphs 

  • Bipartite graph is one with two sets of nodes. Edges can exist between the sets, but not among a set.
  • Algorithm to identify if a Graph is Bipartite: given 2 colors, traverse nodes color them color them and color the neighbor nodes with different colors. If a node has a conflicted color with earlier color decision, the graph is not a Bipartite graph.
  • Define Matching: pairing nodes from two sets in a Bipartite graph, without using repeating elements. (two nodes cannot share the same partner node)
  • Max-cardinality Bipartite Matching - include as many pairs as possible.
    • Greedy algorithm: iterate through nodes in one side, and find one partner node. (Skip the node if no partner can be found)
    • Convert Matching problem into Network Flow
      • By connecting a source node to 1 set and connecting a sink to the other set. Then find network that could flow through from source to sink
        • assign each edge with 1 weight.
      • Solve the network flow problem by Ford-Fulkerson algorithm, and the result paths gave the optimized matching.

  • Let edges between 2 sets of nodes to be a matrix. Each edge has a weight. The goal is to find the matching with the max total weights.
    • Hungarian Algorithm can solve this by finding the minimum weight.
      • prerequisite: the cardinality of  of the two sets must be the same. That means the number of nodes must be the same.
      • Time complexity: O(n**3)
      • Negating the weights converts the algorithm from finding the minimum total weight to maximum total weight.