Exploring Improvements to the SQuAD 2.0 BiDAF Model

We have explored different deep learning based approaches to the question answering problem on SQuAD 2.0 using an improved version of the BiDAF model. Our baseline was provided by the default project starter code, and is a modified BiDAF that has only word embeddings and performs on SQuAD 2.0. We explored three areas of improvements: character embeddings, conditioning the end prediction on the start prediction, and adding a self-attention layer. We found that the biggest improvement was from the Condition End Prediction on Start Prediction and Self-Attention with an F1 and EM score of 65.285 and 61.758 on the test set respectively. The model with character embeddings scored a 59.96 on EM and a 63.24 on F1, and the model with character embedding and self attention scored a 63 on EM and a 66.2 on F1 (both for the dev set). In our error analysis, we discovered that generally, all models performed well on questions that began with "When", and performed poorly on questions that begin with "What" and "The". Our future work includes investigating how further extensions, like transformers, co-attention, and different input features affect performance. Overall, this project was very educational, as it allowed us to read through numerous papers that outlined breakthrough improvements to this problem, and enabled us to implement ourselves the methods described in the papers.