[Paper] Google’s 1.3 MiB On-Device Model brings high-performance Disfluency Detection down to size

Googles 1.3 MiB On-Device Model Brings High-Performance Disfluency Detection Down to Size

AI Google Research proposes small, fast, on-device disfluency detection models based on the BERT architecture. Smallest model size is only 1.3 MiB, representing a size reduction of two orders of magnitude and an inference latency reduction of a factor of eight compared to state-of-the-art BERT-based models.

Read More

Everything you need to know about Google BERT

If you’ve been following developments in deep learning and natural language processing (NLP) over the past few years then you’ve probably heard of something called BERT; and if you haven’t, just know that techniques owing something to BERT will likely play an increasing part in all our digital lives. BERT is a state-of-the-art embedding model published by Google, and it represents a breakthrough in the field of NLP by providing excellent results on many NLP tasks, including question answering, text generation, sentence classification, and more. 

Read More

Clinical Natural Language Processing

Transfer Learning and Weak Supervision. Every day across the country, doctors are seeing patients and carefully documenting their conditions, social determinants of health, medical histories and more into electronic health records (EHRs). These documentation-heavy workflows produce rich data stores with the potential to radically improve patient care. The bulk of this data is not in discrete fields, but rather free text clinical notes. Traditional healthcare analytics depends predominantly on discrete data fields and occasionally regular expressions for free text data, missing a wealth of clinical data.

Photo by Hush Naidoo on Unsplash
Every day across the country, doctors are seeing patients and carefully documenting their conditions, social determinants of health, medical histories and more into electronic health records (EHRs). These documentation-heavy workflows produce rich data stores with the potential to radically improve patient care. The bulk of this data is not in discrete fields, but rather free text clinical notes. Traditional healthcare analytics depends predominantly on discrete data fields and occasionally regular expressions for free text data, missing a wealth of clinical data.
Syndromic information about COVID-19 (i.e., fever, cough, shortness of breath) was valuable early in the pandemic to track spread before widespread testing was established. It continues to be valuable to better understand the progression of the disease and identify patients likely to experience worse outcomes. Syndromic data is not captured robustly in discrete data fields. Clinical progress notes, especially in the outpatient setting, provide early evidence of COVID-19 infections, enabling forecasting of upcoming hospital surges. In this article, we’ll examine how NLP enables these insights through transfer learning and weak supervision.

Photo by Martin Sanchez on Unsplash
Natural language processing (NLP) can extract coded data from clinical text, making previously-“dark data” available for analytics and modelling. With the recent algorithm improvements and simplified tooling, NLP is more powerful and accessible than ever before, however, it’s not without some logistical hurdles. Useful NLP engines require a great deal of labelled data to “learn” a data domain well. The specialized nature of clinical text precludes crowd source labelling, it requires expertise and the clinicians with that expertise are in high demand for much more pressing affairs — especially during a pandemic.
So how can health systems make use of their troves of free text data while respecting clinician time? A very practical approach is with transfer learning and weak supervision.
Modern NLP models no longer need to be trained from scratch. Many state-of-the-art language models are already pretrained on clinical text datasets. For COVID-19 Syndromic data, we started with Bio_Discharge_Summary_BERT available in a pytorch framework called huggingface. As described in the ClinicalBERT paper, the model is trained on MIMIC III dataset of discharge summaries. We used the transformer word embeddings from Bio_Discharge_Summary_BERT as a transfer learning base and fine-tuned a sequence tagging layer to classify entities as with our specific symptom labels. For example, we were interested in “Shortness of Breath”, clinically there are a lot of symptoms that can be classified under this umbrella (e.g., “dyspnea”, “winded”, “tachypneic”). Our classification problem was limited to approximately 20 symptom labels, yielding higher performance results than a generalized Clinical NER problem.
To train this sequence tagging layer, however, we came back to the data problem. Both MIMIC III and our internal clinical text datasets were unlabeled. The few publicly available, labelled clinical text datasets (e.g., N2C2 2010) were labeled with a different use case in mind. How could we get enough data labeled for our targeted use case that is sampled responsibly to prevent bias in the model?
Our strategy had 3 steps: selective sampling for annotation, weak supervision, and responsible AI fairness techniques
We used selective sampling to leverage our clinicians’ time more efficiently. For Covid-19 symptoms, that meant only serving up notes to annotators that were likely to have symptom information in them. A prenatal appointment note or a behavioral health note are very unlikely to be discussing fever, cough, runny nose, or shortness of breath. Strategically limiting the note pool we sent to annotators increased the labels per annotation hour spent by our clinicians. For annotation we provided our clinicians with a tool called prodigy. The user interface was easy for them to use and it is flexible for different annotation strategies.

Created by Author
One of the main decision points when setting up an annotation strategy is determining what granularity you want your annotators to label at. Choosing too high of a granularity like “symptom” would not give us the data we need for our use case but getting too specific like “unproductive cough” versus “productive cough” would be a heavy burden for annotators with no additional benefit for us. For any annotation strategy, it is important to balance burden on annotators with reusability of the labelled dataset. The less we have to go back to the well the better, but if it takes a clinician 2 hours to annotate a single clinical note, we have not succeeded either. For our project, the first pass of annotation was for NER only. We did a later pass for sentiment of the NER (ie. Present, Absent, Hypothetical). Prodigy allows for targeted strategies using custom recipe scripts.
After gathering the Prodigy annotations from our clinicians, we created rules-based labelling patterns to use in SpaCy for weak supervision. Prodigy and SpaCy are made by the same development group, making integration straightforward. Weak supervision is another annotation strategy, however, instead of “gold standard” annotation from clinical subject matter experts, it uses an algorithm to annotate a much larger volume of text. Ideally, the decreased accuracy from using an algorithm is offset by the large number of documents that can be processed. Using an algorithm based on the labelling patterns below we were able to generate a very large training dataset.

{“label”:”SOB”,”pattern”:[{“LOWER”:{“IN”:[“short”,”shortness”]}},{“LOWER”:”of”,”OP”:”?”},{“LOWER”:”breath”}]}{“label”:”SOB”,”pattern”:[{“LOWER”:”tachypnea”}]}{“label”:”SOB”,”pattern”:[{“LOWER”:”doe”}]}{“label”:”SOB”,”pattern”:[{“LOWER”:”winded”}]}{“label”:”SOB”,”pattern”:[{“LOWER”:”breathless”}]}{“label”:”SOB”,”pattern”:[{“LOWER”:”desaturations”}]}{“label”:”SOB”,”pattern”:[{“LOWER”:”gasping”}]}{“LOWER”:”enough”},{“LOWER”:”air”}]}{“label”:”SOB”,”pattern”:[{“LOWER”:”cannot”},{“LOWER”:”get”},{“LOWER”:”enough”},{“LOWER”:”air”}]}{“label”:”SOB”,”pattern”:[{“LOWER”:”out”},{“LOWER”:”of”},{“LOWER”:”breath”}]}

Since our selective sampling biased what notes we surfaced to our annotators, we needed to safeguard against bias in our weakly supervised dataset that would ultimately train the model. Machine learning in the clinical domain requires a higher degree of diligence to prevent bias in models. Responsible AI techniques are becoming mandatory in all industries, but as equality and justice are fundamental tenets of biomedical ethics, we took care to develop an unbiased note sampling approach for weak supervision. For each dataset, clinical notes were sampled in equal numbers across race and ethnicity, geographic location, gender, and age. The labelling patterns were then applied to the notes through SpaCy. The result was an annotated dataset in IOB format for 100,000 clinical notes.

def pandas_parse(x): with open(patterns_file) as f:patterns = json.load(f) if patterns_file.lower().endswith(“json”) else [json.loads(s) for s in f]for p in patterns:p[“id”] = json.dumps(p)spacy.util.set_data_path(“/dbfs/FileStore/spacy/data”)nlp = spacy.load(spacy_model, disable=[“ner”])ruler = EntityRuler(nlp, patterns=patterns)nlp.add_pipe(ruler)return x.apply(lambda i: parse_text(i,nlp))parse_pandas_udf = F.pandas_udf(pandas_parse,ArrayType(ArrayType(StringType())), F.PandasUDFType.SCALAR)#IOB outputdef parse_text(text,nlp):doc = nlp(text)text = []iob_tags = []neg = []for sent in doc.sents:if len(sent) < 210 and len(sent.ents) > 0:text = text + [e.text for e in sent] iob_tags = iob_tags + [str(e.ent_iob_) + ‘-‘ + str(e.ent_type_) if e.ent_iob_ else ‘O’ for e in sent]return (pd.DataFrame( {‘text’: text,’iob_tags ‘: iob_tags }).values.tolist())

Created by Author
At this point we were ready to train our sequence tagging layer. We used a framework called Flair to create a corpus from our IOB labeled dataset. The corpus was then split into dev, train, and validation sets and Flair took it from there. The results were very promising.

– F1-score (micro) 0.9964- F1-score (macro) 0.9783By class:ABDOMINAL_PAIN tp: 977 – fp: 6 – fn: 5 – precision: 0.9939 – recall: 0.9949 – f1-score: 0.9944ANXIETY tp: 1194 – fp: 8 – fn: 8 – precision: 0.9933 – recall: 0.9933 – f1-score: 0.9933CHILLS tp: 343 – fp: 1 – fn: 0 – precision: 0.9971 – recall: 1.0000 – f1-score: 0.9985CONGESTION tp: 1915 – fp: 21 – fn: 6 – precision: 0.9892 – recall: 0.9969 – f1-score: 0.9930COUGH tp: 3293 – fp: 6 – fn: 6 – precision: 0.9982 – recall: 0.9982 – f1-score: 0.9982COVID_EXPOSURE tp: 16 – fp: 1 – fn: 1 – precision: 0.9412 – recall: 0.9412 – f1-score: 0.9412DIARRHEA tp: 1493 – fp: 6 – fn: 0 – precision: 0.9960 – recall: 1.0000 – f1-score: 0.9980FATIGUE tp: 762 – fp: 2 – fn: 7 – precision: 0.9974 – recall: 0.9909 – f1-score: 0.9941FEVER tp: 3859 – fp: 7 – fn: 2 – precision: 0.9982 – recall: 0.9995 – f1-score: 0.9988HEADACHE tp: 1230 – fp: 4 – fn: 5 – precision: 0.9968 – recall: 0.9960 – f1-score: 0.9964MYALGIA tp: 478 – fp: 3 – fn: 1 – precision: 0.9938 – recall: 0.9979 – f1-score: 0.9958NAUSEA_VOMIT tp: 1925 – fp: 7 – fn: 12 – precision: 0.9964 – recall: 0.9938 – f1-score: 0.9951SOB tp: 1959 – fp: 10 – fn: 10 – precision: 0.9949 – recall: 0.9949 – f1-score: 0.9949SWEATS tp: 271 – fp: 0 – fn: 1 – precision: 1.0000 – recall: 0.9963 – f1-score: 0.9982TASTE_SMELL tp: 8 – fp: 0 – fn: 6 – precision: 1.0000 – recall: 0.5714 – f1-score: 0.7273THROAT tp: 1030 – fp: 11 – fn: 2 – precision: 0.9894 – recall: 0.9981 – f1-score: 0.9937WHEEZING tp: 3137 – fp: 6 – fn: 0 – precision: 0.9981 – recall: 1.0000 – f1-score: 0.9990

Given that we trained a transformer language model on a weakly-supervised, rules-based dataset, one might reasonably ask, “why not just use the rules-based approach in production?” However, given that transformer language models (like BERT) use sub-word tokens and context-specific vectors, our trained model can identify symptoms not specified in the rules-based patterns file and can also correctly identify misspelled versions of our entities of interest (e.g., it correctly identifies “cuogh” as a [COUGH]).
With the rich data available in clinical free text notes and the logistical challenges of clinical note annotation, a very practical approach to healthcare NLP is with transfer learning and weak supervision.

Read More