[Paper] Google’s 1.3 MiB On-Device Model brings high-performance Disfluency Detection down to size

A research team from Google Research proposes small, fast, on-device disfluency detection models based on the BERT architecture. The smallest model size is only 1.3 MiB, representing a size reduction of two orders of magnitude and an inference latency reduction of a factor of eight compared to state-of-the-art BERT-based models.

Googles 1.3 MiB On-Device Model Brings High-Performance Disfluency Detection Down to Size

In a new paper, a Google Research team proposes a setof small, fast and accurate disfluency detection models that are suitable for on-device use.

Disfluency detection is the task of identifying self-repairs, repetitions, restarts and filler words in language. Current disfluency detection models have proven their worth, showing high performance accuracy on English texts for automatic speech recognition (ASR) tasks.

Compact ASR apps are increasingly moving from server-side inference to on-device, with benefits that include lower inference time, privacy preservation and the ability to operate offline. Most research on disfluency detection models meanwhile has focused on accuracy, while reducing the size and inference time of these models to enable them to operate on-device with high accuracy is a challenge that has remained relatively unexplored.

The Google Research team’s disfluency detection models are based on BERT architecture and are as small as 1.3 MiB — representing a size reduction of two orders of magnitude along with an inference latency reduction of a factor of eight compared to state-of-the-art BERT-based models.


The researchers summarize their contributions as:

  1. Explore the capabilities of small on-device disfluency detection models for the first time, showing that it is possible to achieve only a slight degradation in performance on a disfluency detection task with a model as small as 1.3 MiB.
  2. Demonstrate the importance of pretraining and the effect of domain mismatch between conversational and written text on model performance.
  3. Find that self-training has a more pronounced effect on these smaller models than on conventional BERT models, while pretraining on Reddit improved the performance of a large BERT-based model.

Recently proposed BERT distillation approaches have enabled significant size and latency reductions. Most of these models follow a student-teacher paradigm: BERT is trained as a teacher, while a compact student model learns from the teacher via various distillation approaches. This method aims to find a replacement (student) model with a much smaller size than the BERT (teacher) model. The Google team’s approach is based on such previous pretrained distilled BERT models, including DistilBERT, MobileBERT, TinyBERT, PD-BERT and Small-vocab BERT, which are then fine-tuned on disfluency detection tasks.

Most of these BERT models were trained on text from Wikipedia and books, which exposes them to a wide range of topics. The linguistic styles of these corpora however are unlikely to match the more conversational and disorganized style of spontaneous speech and dialogue. The team therefore experimented with self-training and pretraining to mitigate domain shifts caused by mismatched styles.

The team conducted experiments to evaluate the performance of their proposed models. They trained the models on the Switchboard corpus of English conversational dialogue and used Reddit comments for pretraining and the Fisher dataset for both pretraining and self-training.


The researchers compared the model performance of BERT and its smaller variants. Small-vocab BERT demonstrated the best performance among the proposed on-device capable models, validating the importance of self-training for improving model performance as model size decreases.


The researchers also performed quantization on MobileBERT, resulting in a further size reduction of their smallest model from 4.7 to 1.3 MiB, with only a small F1 score degradation of around 0.1 or 0.2 points.


Comparisons between the team’s best-performing models and other popular models validated that the proposed small-size models can retain high performance.

Overall, the study takes a significant step toward the building of small, high-performance, on-device disfluency detection models; and presents a number of effective domain adaptation and data augmentation strategies that can be used to improve small size model performance.

The paper Disfluency Detection with Unlabeled Data and Small BERT Models is on arXiv [paper].

Author: Hecate He

Additional Resource


Spread the word

Related posts