SOTA Attention Mechanism and Activation Functions on XLNet

We re-implemented XLNet, a state-of-the-art transformer model from scratch, and experimented with SOTA activation functions including GELU and Mish, as well as Attention on Attention (AoA) mechanism. We analyzed the effect of the above techniques on our XLNet model evaluated on its pretraining behavior. We found that Mish improves model training by smoothing out the learning curve, and that AoA improves model performance by building a strong relationship between query and the traditional attention vector. We then further implemented such building blocks on the original XLNet model to see if the positive effects generalize to larger XLNet model. We pretrained, finetuned and evaluated the model on SQuAD 2.0, and concluded that Mish and AoA benefits XLNet's performance, especially when computing power is limited.