Annotating Sparse Risk Factors in Clinical Records with BERT

Though there is an abundance of medical information collected in patient clinical records, these records are typically in the form of fragmented free text, such that the task of extracting the relevant pieces can be costly. In this project, we revisit the 2014 i2b2 challenge for identifying risk factors for heart disease in clinical records, focusing on annotating the smoking status and family history of cardiovascular disease, two of the most difficult risk factors in the challenge due to the sparsity of their less common classes. The teams participating in the 2014 challenge applied a combination of hand-written rules and classifiers such as SVM; the objective of this paper is to adapt more recently developed transformer models for this task in order to evaluate the suitability of these models and to understand whether these models can be trained as a substitute for more explicit reasoning in rule-based systems. Fine-tuning BERT, as well as clinical BERT and blueBERT -- two BERT-initialized models further pre-trained for the clinical and biomedical domains, we find that clinical BERT and blueBERT achieve slightly higher F1 scores than BERT, but within margin of error. Moreover, we find that basic oversampling and class weighting approaches to address the class imbalance do not improve the overall performance of the BERT models on this task, as the tradeoff weakens the model's performance on more common classes. The extraction of the span of text within a clinical record most relevant to the risk factor, and the length of the span that is extracted, however, do significantly impact the performance -- and for the smoking risk factor, with simple heuristics for extracting the relevant part of a clinical record, BERT models achieve performance comparable to many of the highest scoring models from the 2014 challenge.