Better Learning with Lesser Data: Meta-Learning with DistiIBERT

While pre-trained transformer models have shown great success in recent years, it requires a large amount of task-specific data to finetune. In our project, we have experimented with the a variant of the MAML algorithm, namely Reptile, in a low resource QA program. In contrast to the normal training procedure, MAML algorithm trains the model with a double-loop structure. In the inner loop, the program goes through meta-batches, with T tasks in each. For each of the tasks in the inner-loop, a submodel is made and updates k times. After the k descents have been made for T submodels, they are collected and processed in the Metalearner Reptile, where the next descent on the meta-model is determined. From the first glance of this training protocol, it appears to be similar to the multi-task learning model, since they both expose the model to multiple tasks, which enables transfer learning. Aside from that, one major distinction of MAML is that it makes use of the k th gradients, which enables the SGD to access higher order terms in the loss function, thereby allowing the MAML algorithm to find a better initialization than the other methods and descends at much rapid rate in any tasks in the downstream, as shown in figure (1). Furthermore, the reason MAML can find better model initialization than multi-task learning is that it can avoid overfitting to any one task, which is known to be a tendency in multi-task learning. In the end of the study, we introduce a cost-to-improvement ratio, evaluating whether the additional accuracy gain in MAML can justify the increase in runtime. Despite there is a absolute gain in the accuracy by MAML, we express our reservation in regard to the comparative advantage of MAML, since this 1 point increase in the accuracy comes at large sacrifice of runtime.