A brief history of speech recognition development, from technology to systems

Microcomputer's elaborate evaluation room team is known for its professionalism, rigor, objectivity and impartiality, sharing test data and conclusions for various electronic products. Let you understand that consumption will not be deceived!

In the "Speech Technology Report 2019", the well-known US investment institution Mangrove Capital Partners gave a grand definition of pronunciation-welcome to the next generation of subversives. However, if time goes back 10 years, most people still define "voice interaction" as a big gamble. They all know that the winner is relatively large, but they dare not bet for a long time, because there is no clear deadline to make this concept work, and there will always be unsettling before going on the right path. Certainty. However, in the past 80 years, human hopes for speech technology have never been dashed, just like finding exits in a maze, trying again and again, and finally finding the right path. A long childhood: "What's the weather tomorrow?" "I want to listen to Jay Chou's song." Such instructions occur hundreds of millions of times a day, and even babbling children can talk to smart speakers smoothly. However, 50 years ago, John Pierce, who worked at Bell Labs, wrote a "death certificate" for speech recognition in an open letter: it is almost impossible for a machine to recognize speech, like converting water to gasoline and extracting it from the ocean Gold as well as a thorough cure for cancer. At that time, the first machine capable of processing synthesized speech had been around for 30 years, and it had been 17 years since the invention of a machine capable of understanding speech numbers from 0 to 9. Both of these creative inventions came from Bell Labs, but the slow advances in speech recognition technology have nearly killed everyone's patience. For most of the 20th century, speech recognition technology was like a long march without knowing its direction. Time scale extended to 10 years: In the 1960s, three key technologies of time warp mechanism, dynamic time warp and phoneme dynamic tracking laid the foundation for the development of speech recognition. In the 1970s, speech recognition entered a stage of rapid development. Pattern recognition ideas, dynamic programming algorithms, linear predictive coding, etc. have begun to be applied. In the 1980s, speech recognition began to develop from isolated word recognition systems to large vocabulary continuous speech recognition systems. The framework based on GMM-Hidden Markov Model has become the dominant framework for speech recognition systems. In the 1990s, many product-based speech recognition systems appeared, such as IBM's Via-vioce system, Microsoft's whisper system, and the University of Cambridge's HTK system. However, after entering the 21st century, the error rate of speech recognition systems is still very high, and once again falls into a long bottleneck period. Until 2006, Hiton proposed to initialize neural networks with deep belief networks, which made it easy to train deep neural networks, which led to the wave of deep learning. ▲ As early as the 1950s, Bell Labs began to study speech recognition. The research at that time was mainly a speech recognition system based on simple isolated words. Only about 70 years before 2009, China was almost marginal in speech recognition technology. In 1958, the Institute of Speech of the Chinese Academy of Sciences identified 10 vowels using a tube circuit. In 1973, the Institute of Speech of the Chinese Academy of Sciences began computer speech recognition. Then, the 863 project began to organize research on speech recognition technology until the rapid rise of Chinese companies such as Baidu and the University of Science and Technology. Jumping young people in 2010 are destined to be a turning point in speech recognition. Last year, Hinton and Mohammed applied deep neural networks to speech acoustic modeling and succeeded in the small vocabulary continuous speech recognition database TIMIT. From 2010, Microsoft's Yu Dong and other scholars first tried to introduce deep learning technology to the field of speech recognition, and established three-dimensional standards: the amount of data depends on the search volume and the scale of use; the advantages and disadvantages of the algorithm, top talent Plays a vital role; the key to the level of computing power lies in the development of FPGAs and other hardware. In these three dimensions of competition, whoever has the data advantage, who has gathered top talent, and who has strong computing power, can become the winner of the competition. Therefore, in the "young" era of speech recognition, a leaping development has finally begun, and the time interval for creating new records has been shortened from years to months. The amount of data depends on the search volume and the scale of use; the advantages and disadvantages of the algorithm, and top talents play a vital role; the key to the level of computing power lies in the development of FPGAs and other hardware. In these three dimensions of competition, whoever has the data advantage, who has gathered top talent, and who has strong computing power, can become the winner of the competition. Therefore, in the "young" era of speech recognition, a leaping development has finally begun, and the time interval for creating new records has been shortened from years to months. The amount of data depends on the search volume and the scale of use; the advantages and disadvantages of the algorithm, and top talents play a vital role; the key to the level of computing power lies in the development of FPGAs and other hardware. In these three dimensions of competition, whoever has the data advantage, who has gathered top talent, and who has strong computing power, can become the winner of the competition. Therefore, in the "young" era of speech recognition, a leaping development has finally begun, and the time interval for creating new records has been shortened from years to months. The amount of data depends on the search volume and the scale of use; the advantages and disadvantages of the algorithm, and top talents play a vital role; the key to the level of computing power lies in the development of FPGAs and other hardware. In these three dimensions of competition, whoever has the data advantage, who has gathered top talent, and who has strong computing power, can become the winner of the competition. Therefore, in the "young" era of speech recognition, a leaping development has finally begun, and the time interval for creating new records has been shortened from years to months. The amount of data depends on the search volume and the scale of use; the advantages and disadvantages of the algorithm, and top talents play a vital role; the key to the level of computing power lies in the development of FPGAs and other hardware. In these three dimensions of competition, whoever has the data advantage, who has gathered top talent, and who has strong computing power, can become the winner of the competition. Therefore, in the "young" era of speech recognition, a leaping development has finally begun, and the time interval for creating new records has been shortened from years to months.

The accuracy of speech recognition reached 90% in 2016, but later this year, Microsoft publicly stated that the word recognition rate of the speech recognition system reached 5.9%, which is equivalent to the shorthand level of humans for the same conversation. Then Baidu chief scientist Wu Enda claimed that Baidu had reached the same level by the end of 2015. In June 2017, Google stated that the accuracy of speech recognition reached 95%, and as early as 10 months ago, Li Yanhong announced at the Baidu World Conference that Baidu's speech recognition accuracy reached 97%. This is a somewhat "strange" phenomenon. Why does China, which lacks accumulation in the field of speech recognition, be able to start from scratch in a short period of time, or even be late? There are two reasons for this: First, traditional patent pools are challenged and competition returns to technology. Speech recognition has entered the era of deep learning without bearing too much patent burden. Players at home and abroad have the opportunity to stand on the same starting line. For example, Baidu's speech recognition technology in 2013 was mainly based on Mel-Bank's sub-band cable news network model. In 2014, the sequence discrimination training was independently developed. In early 2015, LSTM-Hidden Markov Model-based speech recognition was introduced, and at the end of the year, an LSTM-CTC-based end-to-end speech recognition system was developed. In 2016 and 2017, the Deep CNN model was combined with LSTM and the Counter-Terrorism Committee.In 2018, it launched the Deep Peak 2 mode, and in 2019 it also introduced a streaming multi-stage truncated attention mode. Since then, Baidu also introduced a Honghu chip for far-field voice interaction, which can realize real-time processing of far-field array signals, high-precision ultra-low false alarm voice wake-up, and offline voice recognition. Second, speech recognition has entered an era of ecologicalization and industrialization. After Google released the voice open application program interface, Nuance suffered a fatal blow, not only because of Google's advantages in products and technology, but also because of Google's powerful artificial intelligence technology ecosystem, such as the deep learning engine represented by TensorFlow. According to the same logic, Baidu opened hundreds of intelligent voice patents in 2015 and formed an intellectual property industry alliance with Haier, JD.com, ZTE, and Putian. At the same time, the openness and open source of PaddlePaddle, Warp-CTC, and Baidu's brain have had a subtle influence on Chinese speech recognition, becoming standard makers in the field of Chinese speech recognition. In addition, Baidu's speech, machine translation and driverless car related patents were awarded in the 20th China Patent Review published in 2018, becoming the highest level government award in the field of artificial intelligence in the domestic patent industry to date. Among them, the new speech recognition model involved in "speech patent"-using deep learning algorithms and high-performance computing, analyzes tens of billions of large-scale data in real time within 24 hours, making the accuracy of speech recognition technology reach 97% , Solved the key and common technical problems in the field of speech recognition, and was rated as "2016 Top Ten Breakthrough Technologies in the World" by MIT. The voice of speech recognition has gradually moved from the labs of universities and institutions to business giants such as Microsoft, Google, and Baidu, and finally ushered in a leap-forward development of ten years. The "youth" of sound technology may still have a long way to go, but it finally came out of the long night and glimpsed the dawn. The temptation of voice interaction

Question and answer phase: Speech recognition begins to have the properties of dialogue on the basis of question and answer. Corresponding products include Apple's Siri, Google Now, Baidu Voice, and Microsoft Cortana. At the time, it was still in the "human-machine dialogue" phase, and the machine passively received a large amount of data input from humans. It cannot understand the meaning of human beings in a deeper level, nor can it achieve self-learning and self-growth, and the voice communication with machines cannot be as natural as humans. Natural interaction stage: From speech recognition to speech interaction, not only there are questions and answers, but artificial intelligence can make personalized decisions or suggestions based on context logic and environmental information. A typical scenario is a smart speaker. Amazon, Google, Baidu, Xiaomi, Ali and others have begun to exert their power in the field of smart speakers. Voice recognition portals are gradually opening up the ecology of content and the Internet of Things, and have become the main battlefield for the battle of artificial intelligence portals. It is not difficult to see such a change: The initial speech recognition is still in the technical manufacturing stage, and it may be just for the novelty and cool experience. However, with the popularization of software and hardware applications such as smart speakers and voice assistants, one by one, pain points have been resolved, and voice interaction has become a possibility for the next generation of human-computer interaction, thus creating a new operating system that takes voice as the entrance. . We can learn from the term "feeling dwarf". The hand and tongue are the two most flexible parts of human beings. From the DOS system to the Xerox graphical interface to the touch interaction of mobile devices, it all depends on hand interaction. However, when speech technology and artificial intelligence mature at the same time, perhaps as described in the "2019 Speech Technology Report": "Voice interaction has reversed the existing form of human-computer interaction in the past, and it is based on the interaction between users and devices. The new relationship has already begun. Just like the transition from the Internet to the mobile Internet, its new demand for the underlying platform is also brewing. "Even if the possibility of voice priority is not ruled out, Amazon's chief scientist Rohit Prasad used to be outspoken Say: "We want to eliminate friction with customers. The most natural way is through sound. It is not only a search engine that can provide a series of results, it also tells you the answer." The implication is that voice technology can help people get rid of The bondage of text and screen and provide a dimension enhanced user experience. The giant's new battlefield has replaced their predecessors. Google, Baidu and other giants are not without "selfishness." Because although voice interaction has become the mainstream of human-computer interaction, it is also rebuilding existing business rules. For example, in the world of touch interaction, people connect with services through one or another application, and there are many super applications in the fields of social interaction, search, e-commerce, and life information. However, voice interaction is a typical service search. The mainstream profit paths such as search, e-commerce, social interaction, and advertising will be restructured, and even the existing market structure will be overturned. A typical example is that whether it is Xiaodu of Baidu, Tmall, Xiao Ai's classmates, Google Assistant, or Amazon's Alexei, they are no longer satisfied with the status of "Voice Assistant" and have begun to function Voice conversations, content services, and IoT device management are developing. They cover on-site families, cars, hotels, and more. An ecosystem based on voice interaction has been formed and has become another killer application besides touch control. ▲ A typical scenario is a smart speaker. Amazon, Google, Baidu, Xiaomi, Ali and others have begun to exert their power in the field of smart speakers. At the same time, the subversive nature of sound is gradually emerging. When you want to listen to a song or watch a movie, you need to open a specific application on your phone, manually enter the name of the song or movie, and find what you need in a series of search results. In a voice interaction scenario, the device can automatically play the song or video you want by simply issuing a corresponding voice command. This can not only increase the efficiency multiple times, but also change the status of the music or video service provider. Back-end content provider. So far, almost all Internet giants are determined to acquire voice, especially in the popular smart speaker circuit, there are many giants abroad Development of audio dialogue, content services, and IoT device management. They cover on-site families, cars, hotels, and more. An ecosystem based on voice interaction has been formed and has become another killer application besides touch control. ▲ A typical scenario is a smart speaker. Amazon, Google, Baidu, Xiaomi, Ali and others have begun to exert their power in the field of smart speakers. At the same time, the subversive nature of sound is gradually emerging. When you want to listen to a song or watch a movie, you need to open a specific application on your phone, manually enter the name of the song or movie, and find what you need in a series of search results. In a voice interaction scenario, the device can automatically play the song or video you want by simply issuing a corresponding voice command. This can not only increase the efficiency multiple times, but also change the status of the music or video service provider. Back-end content provider. So far, almost all Internet giants are determined to acquire voice, especially in the popular smart speaker circuit, there are many giants abroad Development of audio dialogue, content services, and IoT device management. They cover on-site families, cars, hotels, and more. An ecosystem based on voice interaction has been formed and has become another killer application besides touch control. ▲ A typical scenario is a smart speaker. Amazon, Google, Baidu, Xiaomi, Ali and others have begun to exert their power in the field of smart speakers. At the same time, the subversive nature of sound is gradually emerging. When you want to listen to a song or watch a movie, you need to open a specific application on your phone, manually enter the name of the song or movie, and find what you need in a series of search results. In a voice interaction scenario, the device can automatically play the song or video you want by simply issuing a corresponding voice command. This can not only increase the efficiency exponentially, but also change the status of the music or video service provider. Back-end content provider. So far, almost all Internet giants are determined to acquire voice, especially in the popular smart speaker circuit, there are many giants abroad

If 2019 is the new starting point, speech recognition has entered the era of propeller planes from the propeller plane era, and the next goal is undoubtedly to become a rocket-level product. Fortunately, in this battlefield that determines the future of the technology ecology, Chinese players are no longer absent, but have changed from followers to leaders.

Message

name *

email *

Website *