Reformed QANet - Optimizing the Spatial Complexity of QANet
The feed-forward QANet architecture replaced the bidirectional LSTMs of traditional question and answering models by using encoder components with convolution + self-attention to increase the speed of the model without sacrificing accuracy. We achieved scores of 64.5 EM/67.9 F1 on the dev set and 61.64 EM/65.30 F1 on the test set. While the parallel nature of QANet's CNN architecture allows for a significant speed boost, it means that minimizing GPU memory usage is crucial to attain these benefits. In this report we perform an exhaustive study investigating changes to spatial complexity, speed, and performance on the QANet architecture by replacing components in the encoder block with memory-efficient alternatives such as LSH Self Attention, reversible residual networks, and reformer blocks. The image above depicts the QANet encoder block where the self-attention and feed-forward layer are replaced with a reformer, a stack of reversible LSH Self Attention and feed-forward layers. We found that implementing LSH attention successfully decreased memory usage on long sequences while maintaining reasonable performance. While the other modifications did not quite maintain the original QANet model's EM and F1 scores, they significantly decreased GPU memory usage. Additionally, we used data augmentation to enrich training data through back translation and found slight improvements on our larger model.