Bidirectional Attention Flow with Self-Attention

I extended the BiDAF model with varies optimization techniques on the SQuAD 2.0 dataset. With character embedding and multi head self attention been added to the model, my results shows an improvement of +4 point on the EM and +4 point on F1 score compared with the default project. The performance is as expected, but there are also rooms for improvements. One notable finding is I could also generate a masking for each word while training to force the attention computation not focus on the current word but other words of the given inputs.Right after the completion of the project report, i have noticed that other findings reported that a pure Self-Attention is not that helpful without the bias and rank collapse. It seems a pure self attention layer can be converted into a shallow network