Building a QA system (Robust QA track)

While there have been great strides made in solving fundamental NLP tasks, it is clear that the models which tackle these problems fail to generalize to data coming from outside the training distribution. This is problematic since real-world applications require models to adapt to inputs coming from previously unseen distributions. In this paper, we discuss our attempt to create a robust system for extractive question answering (QA). We use a BERT variant as our baseline, and attempt four methods to improve upon it. Our first method is a model that uses the Mixture-Of-Experts (MoE) technique described in the "Adaptive Mixtures of Local Experts" paper and the Robust QA Default Project handout. The second is an original inference-time procedure which predicts the answer span that maximizes the expected F1 score. The third approach is to produce more out-of-domain training examples via data-augmentation. Our final and best-performing method is an Adversarial Training model described in "Domain-agnostic Question-Answering with Adversarial Training". The MoE model and expected-F1-maximization strategy fail to outperform the baseline's F1 score of 47.098, achieving F1 scores of 44.870 and 44.706 on the validation set respectively. Training the baseline with augmented data produces an F1 score of 48.04. Domain Adversarial Training gives the best results when coupled with data augmentation, yielding an F1 score of 51.17 on the validation set. However, we see that on the test set, none of our models were able to the beat the baseline's F1 score of 60.240.