Inactive Instrument

Nuance Communications, Inc.

NUAN

US67020Y1001

Nuance Communications : Reducing the human labeling effort for training end-to-end speech recognition

October 23, 2020 at 01:20 pm EDT

R&D

Reducing the human labeling effort for training end-to-end speech recognition

The latest generation of Nuance's deep learning technology for speech recognition features a novel algorithm integrating data augmentation with semi-supervised learning, which results in state-of-the-art recognition accuracy with much less human labeled data.

Felix Weninger

Posted October 23, 2020

Deep learning technology has rapidly transformed the way that computers perform speech recognition. It has enabled us to build speech recognizers for very challenging applications such as Dragon Ambient eXperience (DAX), which transcribes conversations between doctor and patient. In particular, the end-to-end (E2E) speech recognition system has been a primary focus of research in recent years. Traditionally, automatic speech recognition (ASR) systems consisted of separate components for modeling the acoustic pattern of the smallest spoken unit (i.e. phonemes) of language (acoustic model), the mapping between phonemes and words (pronunciation model), and the dependency of words in a sentence (language model).

In contrast, E2E ASR systems subsume the acoustic, pronunciation, and language models into a single deep neural network (DNN). While such E2E models have been shown to be superior in terms of simplicity and accuracy, they require a large amount of labeled speech to learn the hundreds of millions of parameters necessary to achieve state-of-the art performance. However, manually transcribing large amounts of speech data is a tedious and costly process.

In the paper titled 'Semi-supervised learning with data augmentation for end-to-end ASR', we explored semi-supervised learning (SSL) and data augmentation (DA) for leveraging unlabeled speech data, reducing the amount of labeled speech required, while maintaining recognition accuracy. Starting from a state-of-the-art E2E ASR system for transcribing doctor-patient conversations trained on 1900 hours of manually labeled speech data, we show how to combine SSL with DA to achieve similar accuracy with only ¼ of the labeled training data. Our paper has been accepted at INTERSPEECH 2020, the world's largest conference on spoken language understanding.

SSL is a family of machine learning techniques for training with a small amount of labeled and a large amount of unlabeled data. In ASR, the most common form of SSL is to use a seed ASR system trained only on the labeled data to generate the (pseudo) transcriptions of the unlabeled data. Then, the labeled data (with manual transcription) and the unlabeled data (with automatically generated transcription) are used jointly to train a new ASR system which should perform better than the seed ASR system. Such a process can be iterated with more unlabeled data being included at each iteration.

DA refers to techniques that create copies of the training data by performing small perturbations (e.g. to the frequency spectrum) of the speech signal, while leaving the labels unchanged. In this way, an arbitrary amount of labeled training data can be generated. In our paper, we employ the SpecAugment technique for doing DA, which is depicted below.

Figure 1: The SpecAugment approach modifies the spectrogram of a speech utterance by randomly masking some regions on the time (horizontal) and frequency (vertical) axes.

The most obvious approach to using SSL in combination with DA is to have the seed ASR system generate transcriptions for all the unlabeled data, and then apply DA. However, this simple approach has a drawback wherein erroneous transcriptions by the seed ASR system can be reinforced when using these pseudo transcriptions as training targets for the unlabeled data. In our paper, we proposed several techniques to help avoid this kind of error reinforcement.

The first is the consistency training principle: The seed ASR system is asked to transcribe the unlabeled data after it has been passed through DA, thus generating several copies of the data where both the speech and the transcriptions are slightly modified. In contrast to the simple SSL + DA approach, this avoids training on the same, potentially erroneous, transcription many times. The second technique is the usage of so-called soft labels, where the seed ASR system produces a probability distribution over all possible outputs, rather than being forced to make a hard decision. Finally, we found that E2E ASR systems are prone to repeating parts of sentences, which is why we introduced a heuristic technique to filter out transcriptions where the seed ASR system ran into a 'loop'. The proposed SSL + DA algorithm is sketched in the figure below.

Figure 2: Traditional (a) and proposed (b) approach for combining SSL with DA. The proposed approach differs from the traditional one in the usage of consistency training (doing DA in the label generation process) and soft labels.

We applied this generic approach to two SSL algorithms known from the image classification literature, the Noisy Student and the FixMatch algorithm, and adapted them to the E2E ASR use case. In the Noisy Student algorithm, the seed ASR system is used as a teacher, while the model to be trained is treated as a student. By modifying the inputs to the student model via DA, the learning task becomes more difficult, requiring the student to generalize the teacher's knowledge to multiple variants of the data. The model trained this way can be expected to be more robust and perform better on unseen data.

In contrast, the FixMatch algorithm does not distinguish between teacher and student. It is a realization of the 'self-training' principle, where a model is trained with its own predictions for the unlabeled data. Since in the early stages of training, the predictions of the model are often wrong, it is necessary to have a way to measure the correctness of the predictions in absence of the ground truth. This can be achieved by computing the model's confidence in its output. As shown in the paper, when we only accept predictions with a high confidence, the convergence of the training process is accelerated. Moreover, since the model becomes more and more confident in its predictions over time, the training process iteratively includes more and more unlabeled data.

By applying the consistency training principle along with soft labels and heuristic loop filtering, we were able to outperform the simple approach of doing DA after SSL, achieving 4% relative improvement in word error rate (WER) on doctor-patient conversations. Furthermore, we found that both the Noisy Student and the FixMatch algorithms converged to similar WERs.

When training with 475h labeled data only, we achieved 16.8% WER. By adding 1425 hours of unlabeled data using the Noisy Student approach, we could reach 14.4% WER. This is in a similar ballpark as our best system trained on 1900 hours of labeled data (13.8% WER), while using only ¼ of the labeled training data.

In conclusion, our results put forward a promising avenue towards building state-of-the-art ASR systems with limited labeled data, which will be highly useful for specialized application domains and under-resourced languages in the future.

Franco Mana, Roberto Gemello, Jesús Andrés-Ferrer, and Puming Zhan contributed to the paper and this blog post. The paper will be presented in October at the INTERSPEECH 2020 conference.

Attachments

Original document
Permalink

Disclaimer

Nuance Communications Inc. published this content on 23 October 2020 and is solely responsible for the information contained therein. Distributed by Public, unedited and unaltered, on 23 October 2020 17:19:01 UTC

Latest news about Nuance Communications, Inc.

Stanford Health Care - Stanford Medicine Deploys Nuance Dragon®? Ambient eXperience?? Copilot	Mar. 11	CI
Providence, Nuance Communications, Inc. and Microsoft Enable AI Innovation at Scale to Improve Future of Care	Mar. 08	CI
Wellspan Health Advances Its Leadership in Delivering Exceptional Provider and Patient Experience with Nuance Dax Copilot	Mar. 07	CI
Nuance Communications, Inc. Announces General Availability of Dax Copilot Embedded in Epic, Transforming Healthcare Experiences with Automated Clinical Documentation	Jan. 18	CI
Paige and Nuance Collaborate to Establish the Large Consultation Network in Pathology	Nov. 27	CI
Nuance Communications, Inc. Announces the General Availability of Dragon Ambient eXperience Copilot to Further Improve Healthcare Experiences, Outcomes, and Efficiency	23-09-27	CI
Nuance and Epic Expand Ambient Documentation Integration Across the Clinical Experience with DAX Express for Epic	23-06-27	CI
University Hospitals Selects Nuance Patient Engagement Solutions to Advance Patient Experience While Reducing Staffing Costs	23-06-21	CI
Nuance Communications, Inc. and NVIDIA Bring Medical Imaging Ai Models Directly Into Clinical Settings	22-11-14	CI
Liberty Global and Nuance Communications, Inc. Expand Voice-Enabled TV Services to Provide Accessibility for Visually Impaired Customers	22-09-08	CI
Healthcare Organizations Chooses Nuance Patient Engagement Solutions as the Strategic Imperative for Modernizing the Digital Front Door Accelerates	22-08-31	CI
Covera Health, Inc. and Nuance Communications, Inc. Launch Nationwide Radiology Quality Care Program to Advance Payor-Provider Collaboration At Scale	22-08-17	CI
Nuance Communications Expands Dragon Medical One Availability Through the Microsoft Marketplace to Reduce Clinician Burnout	22-07-19	CI
The Academy and Nuance, a Microsoft Company, Partner to Launch The AI Collaborative	22-05-11	CI
Nuance Joins athenahealth's Marketplace to Broaden Physician Access to the Nuance Dragon Ambient eXperience for Automated Clinical Documentation	22-05-04	CI
Cerner, Nuance Expand Technology Integration to Ease Clinicians' Administrative Workloads	22-03-16	MT
Nuance Communications Introduces Next-Generation Ambient AI Capabilities for PowerScribe Diagnostic Imaging Reporting Platform	22-03-08	CI
Nuance Communications, Inc.(NasdaqGS:NUAN) dropped from S&P Software & Services Select Industry Index	22-03-08	CI
Nuance Communications, Inc.(NasdaqGS:NUAN) dropped from S&P TMI Index	22-03-08	CI
Nuance Communications, Inc.(NasdaqGS:NUAN) dropped from S&P Global BMI Index	22-03-08	CI
Nuance Communications, Inc. Announces Executive and Board Changes	22-03-04	CI
Microsoft Completes Acquisition of Nuance Communications	22-03-04	MT
Nuance Communications, Inc.(NasdaqGS:NUAN) dropped from NASDAQ Composite Index	22-03-04	CI
Microsoft Corporation completed the acquisition of Nuance Communications, Inc. from The Vanguard Group, Inc., FMR LLC, Coatue Management, L.L.C., Viking Global Investors LP, ClearBridge Investments, LLC and others.	22-03-03	CI
UK Watchdog Clears Microsoft Planned Acquisition of Nuance Communications	22-03-02	DJ

Chart Nuance Communications, Inc.

Duration

Period

More charts

Company Profile

Nuance Communications, Inc. is a provider of conversational artificial intelligence (AI) and ambient clinical intelligence. It offers a range of products and services, including clinical documentation, solutions for clinicians, radiologists, and care teams, and security and biometric solutions. Its segments include Healthcare, Enterprise, and Other. The Healthcare segment is engaged in providing clinical speech and clinical language understanding solutions that improve the clinical documentation process, from capturing the complete patient record to improving clinical documentation and quality measures for reimbursement. The Enterprise segment is engaged in using speech, natural language understanding, and artificial intelligence to provide automated customer solutions and services for voice, mobile, Web and messaging channels. The Other segment consists of voicemail transcription services. It serves various markets, such as healthcare, retail, telecom, government, and utilities.

Sector

Software

More about the company

Sector Other Software

	1st Jan change	Capi.
SYNOPSYS INC.	+5.58%	80.86B
CADENCE DESIGN SYSTEMS, INC.	+3.69%	75.4B
DASSAULT SYSTÈMES SE	-14.62%	52.54B
ATLASSIAN CORPORATION	-24.56%	51.47B
PALANTIR TECHNOLOGIES INC.	+31.16%	48.04B
THE TRADE DESK, INC.	+17.75%	40.73B
SEA LIMITED	+55.21%	35.61B
TAKE-TWO INTERACTIVE SOFTWARE, INC.	-10.24%	24.47B
ROBLOX CORPORATION	-21.24%	22.73B

Other Software