The Intelligent Voice team participated in the latest annual NVidia GPU Technology Conference (GTC 2016) held in San Jose, California, April 5 to 7. Each year NVidia announce new products and technologies at GTC. This year the focus was on self-driving cars, virtual reality and deep learning. Coinciding with the GTC event, Intelligent Voice released their pat-pending SmartTranscript™. The SmartTranscript™ uses the new HTML5 standard and is essentially a wrapper for audio and video files.
As well as having the standard play, pause and drag bar navigation tools, the SmartTranscript™, powered by intelligent Voice’s JumpTo™ technology contains a searchable automatically generated full transcription of the speech contents. The SmartTranscript™ also contains an automatically generated list of suggested topics of interest which can be used to navigate the file. This list is also useful for quickly getting a sense of what the file as a whole contains. The SmartTranscript™ is a stand-alone file, and as such it can be emailed, indexed and stored easily on your file system.
For each of the last eight years GTC has featured an Emerging Companies Summit (ECS). ECS is a great way for companies to put their technology in the spotlight in order to find potential partners, investors or attract investment. The event has a strong track-record of helping promising companies win world-wide recognition. Among the previous competitors are Oculus Rift (acquired by Facebook for $2 billion), Gaikai (acquired by Sony for $380 million) and Natural Motion (acquired by Zynga for $527 million). The top prize this year was $100, 000. From an initial entry this year of over 90 companies, Intelligent Voice were shortlisted to the top 12 and invited to pitch at the event. Intelligent Voice would like to congratulate the winner Sadako from Barcelona who are developing a robotic solution for plastic bottle recycling. Intelligent Voice were then given a separate award for innovation for their new SmartTranscript™, winning almost $100, 000 in prizes.
IV are pioneers in the use of GPUs for not only for training, but for decoding, allowing for ultra-high speed and volume speech to text, something only made possible by a UK Government SMART grant: Hats off to InnovateUK!
Intelligent Voice also unveiled some exciting new speech research. Traditionally, speech recognition is a complicated procedure of combinations of different algorithms for feature extraction, dimensionality reduction, sequence modelling and optimisation. Intelligent Voice have managed to simplify the process with deep learning. At an invited talk at GTC, CTO Nigel Cannings outlined a ‘crazy idea he had on a Saturday morning’, to get a deep Convolution Neural Network (CNN) to perform speech recognition. Implemented on the NVidia DIGITS platform, Intelligent Voice used the image classification capabilities of the CNN to classify spectrogram images into classes of phones. Using the TIMIT benchmark speech corpus, 1.4 million spectrogram images in the 61 phone classes were used to train the network achieving state-of-the-art performance.
The animation shows at the top a randomly selected utterance selected from the TIMIT speech corpus and its ground truth transcription. Below that are the phones of the utterance which were transcribed by phoneticians. With the trained network, inference is performed by sliding the spectrogram image through time with the resulting classification shown at the bottom of the animation.