In computer technology, Speech Recognition refers to the recognition of human speech by computers for the performance of speaker-initiated computer-generated functions (e.g., transcribing speech to text; data entry; operating electronic and mechanical devices; automated processing of telephone calls) — a main element of so-called natural language processing through computer speech technology.
The Challenge of Speech Recognition
Writing systems are ancient, going back as far as the Sumerians of 6,000 years ago. The phonograph, which allowed the analog recording and playback of speech, dates to 1877. Speech recognition had to await the development of computer, however, due to multifarious problems with the recognition of speech.
First, speech is not simply spoken text--in the same way that Miles Davis playing So What can hardly be captured by a note-for-note rendition as sheet music. What humans understand as discrete words, phrases or sentences with clear boundaries are actually delivered as a continuous stream of sounds: Iwenttothestoreyesterday, rather than I went to the store yesterday. Words can also blend, with Whaddayawa? representing What do you want? Second, there is no one-to-one correlation between the sounds and letters. In English, there are slightly more than five vowel letters--a, e, i, o, u, and sometimes y and w. There are more than twenty different vowel sounds, though, and the exact count can vary depending on the accent of the speaker. The reverse problem also occurs, where more than one letter can represent a given sound. The letter c can have the same sound as the letter k, as in cake, or as the letter s, as in citrus.
History of Speech Recognition
Despite the manifold difficulties, speech recognition has been attempted for almost as long as there have been digital computers. As early as 1952, researchers at Bell Labs had developed an Automatic Digit Recognizer, or "Audrey". Audrey attained an accuracy of 97 to 99 percent if the speaker was male, and if the speaker paused 350 milliseconds between words, and if the speaker limited his vocabulary to the digits from one to nine, plus "oh", and if the machine could be adjusted to the speaker's speech profile. Results dipped as low as 60 percent if the recognizer was not adjusted.
Audrey worked by recognizing phonemes, or individual sounds that were considered distinct from each other. The phonemes were correlated to reference models of phonemes that were generated by training the recognizer. Over the next two decades, researchers spent large amounts of time and money trying to improve upon this concept, with little success. Computer hardware improved by leaps and bounds, speech synthesis improved steadily, and Noam Chomsky's idea of generative grammar suggested that language could be analyzed programmatically. None of this, however, seemed to improve the state of the art in speech recognition.
In 1969, John R. Pierce wrote a forthright letter to the Journal of the Acoustical Society of America, where much of the research on speech recognition was published. Pierce was one of the pioneers in satellite communications, and an executive vice president at Bell Labs, which was a leader in speech recognition research. Pierce said everyone involved was wasting time and money.
It would be too simple to say that work in speech recognition is carried out simply because one can get money for it. . . .The attraction is perhaps similar to the attraction of schemes for turning water into gasoline, extracting gold from the sea, curing cancer, or going to the moon. One doesn't attract thoughtlessly given dollars by means of schemes for cutting the cost of soap by 10%. To sell suckers, one uses deceit and offers glamor.
Pierce's 1969 letter marked the end of official research at Bell Labs for nearly a decade. The defense research agency ARPA, however, chose to persevere. In 1971 they sponsored a research initiative to develop a speech recognizer that could handle at least 1,000 words and understand connected speech, i.e., speech without clear pauses between each word. The recognizer could assume a low-background-noise environment, and it did not need to work in real time. By 1976, three contractors had developed six systems. The most successful system, developed by Carnegie Mellon University, was called Harpy. Harpy was slow—a four-second sentence would have taken more than five minutes to process. It also still required speakers to 'train' it by speaking sentences to build up a reference model. Nonetheless, it did recognize a thousand-word vocabulary, and it did support connected speech.
Research continued on several paths, but Harpy was the model for future success. It used hidden Markov models and statistical modeling to extract meaning from speech. In essence, speech was broken up into overlapping small chunks of sound, and probabilistic models inferred the most likely words or parts of words in each chunk, and then the same model was applied
again to the aggregate of the overlapping chunks. The procedure is computationally intensive, but it has proven to be the most successful. Throughout the 1970s and 1980s research continued. By the 1980s, most researchers were using hidden Markov models, which are behind all contemporary speech recognizers. In the latter part of the 1980s and in the 1990s, DARPA (the renamed ARPA) funded several initiatives. The first initiative was similar to the previous challenge: the requirement was still a one-thousand word vocabulary, but this time a rigorous performance standard was devised. This initiative produced systems that lowered the word error rate from ten percent to a few percent. Additional initiatives have focused on improving algorithms and improving computational efficiency. In 2001, Microsoft released a speech recognition system that worked with Office XP. It neatly encapsulated how far the technology had come in fifty years, and what the limitations still were. The system had to be trained to a specific user's voice, using the works of great authors that were provided, Even after training ,the system was fragile enough that a warning was provided, "If you change the room in which you use Microsoft Speech Recognition and your accuracy drops, run the Microphone Wizard again." On the plus side, the system did work in real time, and it did recognize connected speech. Speech Recognition Today
Current voice recognition technologies work on the ability to mathematically analyze the sound waves formed by our voices through resonance and spectrum analysis. Computer systems first record the sound waves spoken into a microphone through a digital to analog converter. The analog or continuous sound wave that we produce when we say a word is sliced up into small time fragments. These fragments are then measured based on their amplitude levels, the level of compression of air released from a person’s mouth. To measure the amplitudes and convert a sound wave to digital format the industry has commonly used the Nyquist-Shannon Theorem.
The Nyquist –Shannon theorem was developed in 1928 to show that a given analog frequency is most accurately recreated by a digital frequency that is twice the original analog frequency. Nyquist proved this was true because an audible frequency must be sampled once for compression and once for rarefaction. For example, a 20 kHz audio signal can be accurately represented as a digital sample at 44.1 kHz.
The most important goal of current speech recognition software is to recognize commands. This increases the functionality of speech software. Software such as Microsost Sync is built into many new vehicles, supposedly allowing users to access all of the car’s electronic accessories, hands-free. This
software is adaptive. It asks the user a series of questions and utilizes the pronunciation of commonly used words to derive speech constants. These constants are then factored into the speech recognition algorithms, allowing the application to provide better recognition in the future. Current tech reviewers have said the technology is much improved from the early 1990’s
but will not be replacing hand controls any time soon.
Second to command recognition is dictation. Today's market sees value in dictation software as discussed below in transcription of medical records, or papers for students, and as a more productive way to get one's thoughts down a written word. In addition many companies see value in dictation for the process of translation, in that users could have their words translated for written letters, or translated so the user could then say the word back to another party in their native language. Products of these types already exist in the market today.
Errors in Interpreting the Spoken Word
As speech recognition programs process your spoken words their success rate is based on their ability to minimize errors. The scale on which they can do
this is called Single Word Error Rate (SWER) and Command Success Rate (CSR). A Single Word Error is simply put, a misunderstanding of one word in a spoken sentence. While SWERs can be found in Command Recognition Programs, they are most commonly found in dictation software. Command Success Rate is defined by an accurate interpretation of the spoken command. All words in a command statement may not be correctly interpreted, but the recognition program is able to use mathematical models to deduce the command the user wants to execute.
Future Trends & Applications
The Medical Industry
For years the medical industry has been touting electronic medical records (EMR). Unfortunately the industry has been slow to adopt EMRs and some companies are betting that the reason is because of data entry. There isn’t enough people to enter the multitude of current patient’s data into electronic format and because of that the paper record prevails. A company called Nuance (also featured in other areas here, and developer of the software called Dragon Dictate) is betting that they can find a market selling their voice recognition software to physicians who would rather speak patients' data than handwrite all medical information into a person’s file.
The Defense industry has researched voice recognition software in an attempt to make complex user intense applications more efficient and friendly. Currently voice recognition is being experimented with cockpit displays in aircraft under the context that the pilot could access needed data faster and easier.
Command Centers are also looking to use voice recognition technology to search and access the vast amounts of database data under their control in a quick and concise manner during situations of crisis. In addition the military
has also jumped onboard with EMR for patient care. The military has voiced
its commitment to utilizing voice recognition software in transmitting data
into patients' records.