Automatic speech recognition (ASR) and other natural speech and language processing techniques have become ubiquitous in the technologies that surround us in today’s world. From my cell phone to my dashcam to my nightstand, I always have some form of digital assistant
nearby, which I can summon with the sound of my voice.
For general digital assistants, I’m not restricted in what I can say. Even if my instructions are not supported by the assistant, it seems as though any speech I produce will be transcribed accurately (I commanded my phone assistant to repair my hyperdrive motivator; though the transcript was correct, it feigned ignorance). If I ask for information, it uses the transcription as a web search query. For example, “Who shot Archduke Franz Ferdinand of Austria?” brings up the Wikipedia article regarding that fateful day in 1914. Asking “How do I change the alternator in a ‘95 Ford Escort?” provides a list of youtube videos. At times, it seems there is no limit to the extent of knowledge my incantations can conjure.
But alas! My fondness for Korean cuisine has become the undoing of this fantasy. When I request recipes for kimchi jjigae or dak-bokkeum-tang, the transcript shows that it is truly stumped. Also, it is unable to recognize technical terms from the world of speech recognition such as “Kneser-Ney smoothing”, “senones”, or “Mel-frequency cepstral coefficients”. You might be thinking, why on Earth would you expect a phone assistant to recognize those words? They’re either foreign language terms or technical jargon that most human speakers of English wouldn’t understand. And recognizing that a phone assistant is a general domain assistant, such a free pass is warranted in this instance.
General domain recognizers are the decathletes of ASR. They can do everything very well. They run; they jump; they can even shot put. You can ask questions from history, chemistry, auto repair, etc. and a good general domain recognizer will be able to transcribe most of it correctly, with errors occurring mainly around domain-specific names and terms. These models are made possible by big data and deep learning—by training recurrent and attention-based neural networks on thousands of hours and billions of words of data.
There are a few issues with general domain recognizers, however. First of all, they’re expensive. If you don’t happen to be Google, Amazon, or Facebook, you are not likely to have access to unlimited amounts of real user data. Collecting enough data to be able to recognize speech from arbitrary domains is expensive, time consuming, and difficult. And even then, you must come to terms with the fact that you will never have as much data as them.
Another issue with general domain recognizers is that although they are good at all domains, they aren’t the best in any domain. Consider the world record holder for the decathlon, Kevin Mayer. He trains to be the best decathlete in each of 10 events. His personal records for these events are certainly amazing: 10.55 seconds for the 100 meter, 17.08 meters for the shot put, 7.80 meters for the long jump.
Despite these impressive records, Kevin Mayer would not want to race against Usain Bolt, who holds the 100m record at 9.58 seconds. Bolt’s record is almost a full second faster than Mayer’s. Why? It is because Bolt is a sprinter and that’s what he targets his training toward. Mayer would probably beat Bolt in long jump, shot put, or hurdles because Bolt doesn’t train for those things. He knows the breadth of his domain and trains exclusively toward that aim.
The same holds for speech recognition models. If your product uses a voice interface to a culinary recipe catalog, it would not provide any benefit if the language model could recognize “Archduke Franz Ferdinand” or “Mel-frequency cepstral coefficients”. It would be a huge problem, however, if it was blind to the names of popular dishes such as bulgogi or beef bourguignon. If you know your target domain, you can focus the training of your speech recognition model on the data that matters most. In this way, you can quickly and affordably surpass the performance of general domain recognizers in areas where your users will notice.
Cobalt recently performed an experiment demonstrating the effectiveness of domain adaptation for automatic speech recognition of college-level chemistry lectures. Adaptation refers to taking a general base model and using targeted domain data to further train the model so it becomes an expert in the given domain. In ASR, the main types of adaptation are acoustic model adaptation and language model adaptation. In this experiment, we used only language model adaptation.
The language model (LM) is the part of an ASR model that learns how words are put together to form utterances (see our ASR overview for more information). Our general American English language model was trained on data from a variety of sources that include spontaneous, prepared, and newswire speech, in addition to written English such as that found in literature, blogs, and news articles. Undoubtedly, the type of language you hear from a chemistry professor is not the same as what you would hear from a news anchor or in a conversation between two strangers. For this reason, language model adaptation is needed to make our model an expert on how chemistry professors talk.
For the experiment, we identified a number of Ivy League professors who had their lectures posted online with manual transcripts. We chose one video from each professor as the test set. Then we collected transcripts of other chemistry lectures from the internet until we assembled an in-domain dataset of about a half million words. Although 500,000 words is a very small dataset for training a language model, we used this data to adapt our general LM and train it on the kind of speech used in chemistry lectures.
We ran the test set through our general model, our adapted model, and a 3rd party public cloud API. The results may be seen in Table 1 below. The metric used for measuring performance is word accuracy (WAcc). With domain adaptation, our word accuracy improved from 78.6% to 88.9%, reducing our number of errors by 48% relative. This represents a significant increase in accuracy with a rather small amount of additional training data. More importantly, most of the improvements are centered around terms which are pervasive in chemistry lectures, but may be rarely spoken in other contexts.
Table 2 helps demonstrate why adaptation provides such improvements. Some words, such as “oxygens” or “stoichiometric,” did not exist in the general model at all. Of course, our general model contained the common word “oxygen”, but its plural, “oxygens”, had not been encountered in the general training data. Other words, although they existed in the general model, had such low probabilities that they were rarely recognized. This includes words like “valence”, which was often misrecognized as “balance” or “available” in the general model. Adaptation is able to ensure that the model contains all the relevant vocabulary with probabilities high enough to be accurately recognized when spoken.
As an added bonus, if you know who your target users are and have collected some of their speech, this data may also be used for adaptation. We used lectures from the target professors (excluding the test lectures) as training data for a third speaker-adapted LM. When we used the in-domain data and speaker data to adapt the model, we saw an extra 5% relative reduction in errors over the domain adapted model, increasing our accuracy to 89.4.
Table 1: Results of testing various ASR models on chemistry lectures.
Model | Word Accuracy |
Cobalt General Model | 78.6 |
3rd Party General Model | 81.4 |
Cobalt Domain Adapted Model | 88.9 |
Cobalt Speaker Adapted Model | 89.4 |
Table 2: Comparison of selected recognition results. The transcripts produced by the domain adapted model are correct. Errors made by the general model on the same speech are indicated in uppercase.
General Model | Domain Adapted Model |
when you have an odd number AVAILABLE electrons | when you have an odd number of valence electrons |
molecule WITHIN ON PAIRED electron | molecule with an unpaired electron |
<UNKNOWN> really like to be terminal | halogens really like to be terminal |
it’s not possible for every ADAM to have | it’s not possible for every atom to have |
the two single OXYGEN IS here | the two single oxygens here |
It is easy to see applications for these kinds of adaptations. As an example, Cobalt’s technology powers a medical dictation application with language models adapted for various specialties like radiology and reproductive health. Specializing allows the models to be highly accurate on technical vocabulary, and still small enough to run locally so that patient data is kept private, rather than sent to a cloud provider with a general model.
At Cobalt, we specialize in customizing ASR and other speech-related models to our customers’ use cases. This simple experiment just scratches the surface; by employing tools such as enhanced data collection, acoustic model adaptation, and language model rescoring, we can further enhance our models to fit your domain. Get in touch to see how we can help you.
About the author
Ryan Lish is a research scientist with expertise in natural speech and language processing. He studied linguistics at Brigham Young University and computational linguistics at the University of Washington. At Cobalt, he specializes in training acoustic and language models for ASR, in addition to his work developing Cobalt’s dialog management and natural language understanding (NLU) capabilities.