从Switch-Transformer到Switch-Head:聊聊MoE

好久不更新了，今天来聊聊MoE技术，一个很早就有的、一直到最近才开始陆陆续续出现的、对提升模型能力实际上很重要的技术。参考论文：

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

Mixtral of Experts

MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

Adaptive Mixture of Local Expert

A Review of Spaerse Expert Models in Deep Learning

Unified Scaling Laws for Routed Language Models