The failure to utilise fast and accurate speech to text technology means that millions of hours of audio files are left undisclosed at trials every year, says Nigel Cannings, a lone voice in the wilderness on this topic.
Nineteen years ago, my father Bill Cannings, granddaddy of the European PC and eDiscovery industries, returned from LegalTech in Atlanta with a bee in his bonnet: “They still don’t get it!” He was talking about OCR, a technology he introduced in 1989 in the ground-breaking eDiscovery product R/KYV. And he is a man who should know about bleeding edge technology, having introduced desk top computing to the UK market via his Byte Shop and Computerland chains in 1978.
By 1996, the UK market had well and truly accepted that OCR, by no means a perfect tool, was good enough to index hundreds of thousands of documents to a high enough standard to become useful, if not essential, to the eDiscovery process. There were all the usual arguments about admissibility of images in the courtroom. However, the first time e-discovery was brought to trial at Southwark Crown Court in 1998 using an R/KYV system designed for the Inland Revenue, the Revenue said that the efficiencies of e-discovery of images saved them £30,000 every day that the trial ran. The amount they saved in first pass case assessment using OCR technology was considerable. This may have reduced an income stream for lawyers reading documents but the technology was here to stay.
In 2001 Bill’s team designed and implemented the ‘Courtroom of the Future’, which was the template for all future trials and is still used today.Why is any of this relevant? Because when I came home from LegalTech this week, I was ranting to Bill in curiously similar style, but this time about the use of voice-to-text as part of the eDiscovery process. I have been a lone prophet in the wilderness for the last few years, espousing the use of fast and accurate speech to text technology to unlock the millions of hours of audio files left undisclosed in legal proceedings every year. Fortunately, one or two prescient souls saw the potential and we now have a list of resellers and users that would shame many a large software firm, with new orders coming in every day, mainly from the UK, US and Europe. Take a look at Monday’s Raconteur to see what Epiq Systems has to say on the subject.
But there is still a huge groundswell of lawyers, software vendors and litigation support providers who just don’t get it: And if they are not careful, someone is going to eat their lunch very soon…So, why is speech to text important, and why the reluctance to take it on? For years, “audio” in the eDiscovery process has meant only one thing: Phonetic search, which is the extraction of phonetic content from an audio file, and the building up of a search index allowing for various degrees of confidence. Being an audio based technology, there is a large degree of “garbage in, garbage out”, so user experience will always vary. But leaving aside issues of cost and accuracy the one thing that all users report is that, really, they would like to see the text.
Why? For one thing, document reviewers are good at exactly that, the review of actual documents. And second, the text allows them to make full use of the capabilities of their existing eDiscovery systems, utilising concept searching and one single search and review UI. But there have been two main arguments put up to counter the growth and availability of speech to text: “Too slow, and not accurate enough." "It’s slow”: A criticism that until recently has been justified.
Speech systems can be thought of in two parts (although see below, even that paradigm is shifting): Acoustic and language.The acoustic stage is common across phonetic search systems and speech-to-text systems. The audio signal is analysed to extract “phonemes”, a mathematical construct that is supposed to mimic the sounds that humans make. Based on research going back to the 1960’s, these have a habit of being inaccurate, particularly in telephone or co-mingled speech.
Phonetic systems produce a “confusion matrix”, simply put a lattice of the original phonemes, and then alternative phonemes that might be similar. This goes into a database, and awaits a search term. The search term is deconstructed into its phonetic components, and that is matched to the database.Speech-to-text, on the other hand, takes those phonemes at source, and attempts to reconstruct the original sentence based on context, using a language model. This is a computationally intensive process, and is the reason why speech-to-text has historically been bound to short runs. Now, however, using nVidia GPU technology you can process thousands of hours of data in the time it would traditionally have taken to process dozens, all using small hardware appliances. I’m proud to say that my company makes the world’s first commercially available speech-to-text system to run on nVidia GPU’s, built by Boston Systems.
“It’s not accurate enough”: In the last two years, the speech recognition world has changed, significantly. GPU technology also allows you to harness the power of machine learning and Neural Networking. Rather than forcing a system to rely on a forced linguistic structure, you give it the freedom to decide what has been said based on pure data and what it has learned previously. Not only does this give you greater accuracy in difficult environments such as noisy offices or where there is music, it also allows the system to “guess” phrases that were traditionally considered to be “out-of-vocabulary”.
If you ask Google what the next wave in speech technology is, they will point you to their Neural Networking and Deep Speech projects. In 2-3 years’ time, you will never hear mention of the word “Phoneme” or “Phonetic” again.
So wake up world! Your clients have millions of hours of discoverable audio, all of which can be processed quickly and cheaply: And if you don’t do it, someone else will.
Nigel Cannings is CTO of Intelligent Voice: http://www.intelligentvoice.com