Wednesday, February 21, 2024

Summary on Shusen Wang's video of: Multi-gate Mixture-of-Experts (MMoE)

The idea is to train multiple expert neural networks and apply weighted sum of their results. The weights is decided by another neural network based on the same input but connected to a softmax layer to generate weights. The vector from the weighted sum is then feed into another neural network layer to predict  a aggression task.

When multiple predictions (targets) are made, they shared the same expert neural networks, but each target has its own neural network for weight generation, and succeeding neural network to process the  vector from weighted sum.

When 1 weight is close to 1 and other weights are are close to 0, only 1 expert neural network performs work, skipping all other expert networks. This polarization would be an unexpected setup that wastes resource. To avoid polarization in the weights, Dropout operation is performed on the softmax output during training to give 10% of the chance to drop the result from an expert network. As a result, the solution of putting all logic into 1 expert networks leads to poor performance - this in turn encourages multiple expert networks are trained at the same time.

Reference: https://www.youtube.com/watch?v=JIEwaPARjfk

No comments: