Building a QA System (IID SQuAD track)

In this project, we explored different techniques in the encoding layer, the attention layer and the output layer of an end-to-end neural network architecture for question answering. Experiment results show that better performance can be achieved with different enhancements on top of the baseline model. Especially, with extra character embedding and deep residual coattention, we can achieve EM of 61.17 and F1 of 64.97 in comparison to EM of 58.32 and F1 of 61.78 of the baseline BiDAF model. To better understand the behavior of the best performed model, we broke down the F1 score distribution for the development set and examined the performance across different context lengths, answer lengths, and question types. Furthermore, by inspecting some of the error examples, we found that the model performs poorly mainly when it involves reasoning or advanced/complicated sentence structures.