This year at Denver Startup Week, I had a chance to speak about our use of deep learning in speech alongside Eric Harper of NVIDIA and Frank Burkholder from Galvanize. Frank led with the results of a project that uses deep learning to identify automobile information in photographs using convolutional neural networks. It was a fascinating experiment and the presentation of visual results from the network training and inference illustrated how networks learn as well as some of their strengths and weaknesses. I followed Frank with a talk illustrating the problem of speech recognition, how it was addressed with older statistical models, and how deep learning fills the gaps left by these older methods.
Share this Post
As I’ve written before, this is a textbook disruption story. Speech recognition accuracy has grown exponentially over the last few years using off the shelf components (GPU’s, Deep Learning Libraries, etc.) to displace the expensive integrated solutions offered by the incumbents. Without these advancements in GPU’s and deep learning, we would still be fitting models to speech as best we can and producing expensive, yet mediocre results. A significant portion of my presentation is designed around this key point. When I finished, Eric seemed to pick up where I left off, even though we had not spoken before the presentation. He completed the story of the current GPU advancement from NVIDIA’s perspective and shared some performance numbers that completely blew me away.
In our lab, we have numerous servers equipped with NVIDIA GPU’s. The fastest ones we have today are NVIDIA Titan X GPU machines. Each of these cards delivers 11 TFLOPS. So, a machine with 8 of these would have 88 TFLOPS of theoretical computing capacity – faster than the fastest computer in the world in 2005. But, Eric’s presentation included performance numbers from the new Tesla V100 GPU’s – 12O teraflops (trillion floating point operations per second) each! The presentation included results from a system with 16 of these Volta GPU’s NVLinked together (NVIDIA’s high-speed GPU communication channel). That’s a theoretical 1.9 petaflops! Or, 1.9 quadrillion floating point operations per second.
The current revolution in speech recognition was built on the backs of cards that produced ten or less teraflops each and systems with just a few of these cards. With this recent leap, it becomes abundantly clear that speech can still get better. Quite a bit better. And it will improve from where we are today as GPU’s and techniques are even further refined. But, this new class of GPU’s will drive solutions that we haven’t considered yet. Advances in biochemistry, healthcare, natural language processing, and other fields that are still eluding researchers may be just around the corner as we apply previously unimagined computing power and data availability.
We have been thrilled at the advances in speech recognition enabled by GPU-powered deep learning. It is at the core of SayIt. And, thanks to these advancements by companies like NVIDIA, our ability to make speech recognition better continues to grow as well. As outstanding as these advances are, I believe we all have much more to look forward to as these new tools and methods end up in the hands of bright minds around the world. We are making systems that hear you well, and everyone knows about driverless cars. Now we must ask, what comes next?
For more information about SayIt, and how it can help take you to the next level in speech technology, contact us at firstname.lastname@example.org.