There is a sentiment amongst some observers that speech recognition technology has not advanced much. This is absolutely not the case.
Speech recognition has been around since there have been microprocessors. For the first few decades it was dominated by approaches borrowed from probability and statistics. In a nutshell, a collection of text and audio snippets were analyzed as follows: The text snippets were turned into sequences of phonemes (units of pronunciation, roughly 40 for US English), and the audio segments were sampled every 10ms, or so, to produce a feature vector of how much energy there was in each of ‘N’ frequency bands (typical values for N ranged from 13 up to 40, or so). The next part of the analysis was to crunch the numbers to find out for each phoneme what the statistical distribution of feature vector values was. In other words, for phoneme ‘d’, you could work out that the energy in frequency band 5 had some mean value with some standard deviation, and similarly for the remaining frequency bands. Several decades were spent analyzing increasingly large data sets, and bringing ever larger statistical hammers to bear. The improvements were incremental and hard won. For people who have been around speech recognition from the outset, the progress may have seemed glacial.
This statistical approach gave rise to terms such as “Gaussian Mixture Models,” “Hidden Markov Models,” “Feature Vectors,” and a slew of other wonderfully opaque terms.
As good as those models were, the last few years have marked the beginning of a transition away from the statistical approach. Researchers found they could train a neural net using the same feature vectors that they used for statistical analysis and get a net to imitate the best of the statistical models. This has given rise to the term “Hybrid Models,” since they are a hybrid of the statistical approach and neural nets. It was a pleasant discovery that, though these nets had only been trained to imitate statistical models, they actually outperformed them. The new “state of the art” moved from pure statistical models to the hybrid model.
However, the hybrid approach begged an obvious question: Could a neural net learn starting from just audio and text? In 2014 and 2016, Baidu Research published two papers (“Deep Speech” and “Deep Speech II”) where they trained nets from scratch to recognize speech and output spellings (not phonemes) for both English and Chinese. They demonstrated that statistical models were not necessary and that you could skip phonemes and go straight to characters or words. Baidu’s approach also addressed one of the ugly facts about speech recognition – the requirement of meticulously curated training sets, where audio samples have been transcribed and marked up to include special tokens for non-speech audio (coughs, “ers”, “ums”, and extraneous noises). Baidu’s nets figured out what was speech and what wasn’t without any external guidance.
On an entirely separate track from speech recognition researchers, both academic and corporate, have been devising new types of neural nets to analyze images, for purposes such as recognizing house numbers in street images, or recognizing diverse objects, such as cars, streets, pedestrians, buildings, and so forth. As it turns out, the components of neural nets which are used for image recognition also work well for speech recognition. This is not surprising, if you have ever seen a spectrogram which is a pictorial representation of speech, showing a “heat map” of how the energy (heat) at different frequencies changes over time, where you can see the shape of the individual phonemes. The “convolutional layer” in image processing neural nets can learn to spot phonemes in spectrograms just as well as house numbers in street images.
One subtlety in training a neural net to recognize speech from text and audio samples, is that you are withholding a key piece of information from the net: when do the individual phonemes (or letters) begin and end in the audio? As humans we are very good at listening to audio and knowing exactly when phonemes or words begin and end. We are so good at it, they we just “assume” it is obvious. But the naïve neural net has no idea. With hybrid nets we can solve the problem in advance for the net – the statistical analysis can determine where the phonemes begin and end and supply the “answers”. The Baidu researchers leveraged an algorithm devised by Alex Graves in 2006, which goes by the ungainly name “Connectionist Temporal Classification” (or more commonly “CTC”), which can (simplistically) iterate through all the ways a given sequence of phonemes (or letters or characters) can be laid out, including gaps between them (for silence, noises, and other artifacts). This algorithm has allowed pure neural net speech recognition systems to become competitive (and exceed) the hybrid models discussed above.
If you browse around the web looking at demonstrations of what neural nets can do, you will soon get the feeling that you are playing “Buzzword Bingo”. Here are a few of the terms you will run across with a brief description. Even though the demonstrations you find may be for something other than speech recognition, all these components can be used in speech recognition, and its converse speech synthesis.
These are the work horses of image recognition. They can be used to recognize handwriting, find where a dog is in an image, classify content, and so forth.
These are the work horses of anything which requires recognizing sequences and inferring what comes next. In linguistics you can feed lots of word sequences into a recurrent net and it will get pretty good at predicting likely continuations of a given word sequences. This is behind some of the demonstrations you may have seen of getting a net to write a news article. A trivial example: feed a recurrent net segments snipped out of a “sine wave,” and even though the net has never seen the full sine wave, if you give a starting point (or starting segment) it will draw a pretty good sine wave.
Generative and Adversarial Nets:
This where you pit two neural nets against each other in order to train them. One net is given the task of synthesizing an image or sound and the other net is given a collection of both real and fake images or sounds and must identify which is real and which is fake. The synthesizer is forced to produce more and more convincing images or sound bites, and the discriminator gets better at telling real from fake. These can be used to embed pictures of celebrities in scenes (which they were never in) or make them say things they never said.
Embedding or Clustering Nets:
This is where you train a net to spot relationships between words (or members of collections). In the case of words, the net learns to turn words into vectors, such that if two vectors are “close” to each other, it represents the closeness of some relationship and you can perform a kind of math on the vectors: “King” + (“Woman” – “Man”) ≈ “Queen.” These nets can be used for linguistic analysis by modeling relationships between concepts.
Denoising Nets, AutoEncoders, Attractor Nets:
These are various nets which can be used for tasks such as reducing noise or separating speakers when there is cross-talk in the audio. These are attractive components to put in a processing pipeline to bump the accuracy of an acoustic model which wasn’t trained on noisy audio or audio with multiple speakers.
The take home point is that the bulk of the research being published today, using these and other components, involves neural net toolkits and underneath these toolkits GPUs and has enabled the move away from statistical models with significant gains in both accuracy and speed.