Pretraining of Transformers on Question Answering without External Data

Can recent Transformer-based pretraining approaches still perform effectively on question answering without external data and large computational resources? We find that an ELECTRA-style MLM objective can significantly reduce the computational cost of pretraining, and the train-test discrepancy can be reduced by using a small vocabulary size and question augmentation. These methods can boost the F1 score of a Transformer model on the SQuAD 2.0 task from (far below) 52.2 to just over 60.4 on a development set. However, the Transformer model relies mostly on textual similarity between the question and context, rather than on language understanding, to predict answers. The model still performs worse than a baseline BiDAF model, suggesting that the ability of current state-of-the-art training objectives and model architectures to learn effectively from limited data is still severely lacking. We hope that future methods, even with a general model architecture and objective, are able to perform well in a low-resource setting, and that this should also lead to approaches that learn more quickly, effectively, and generally by learning patterns, rather than correlations, that capture the meaning of language