Log in
Show password
Forgot password ?
Become a member for free
Sign up
Sign up
New member
Sign up for FREE
New customer
Discover our services
Dynamic quotes 


SummaryMost relevantAll NewsAnalyst Reco.Other languagesPress ReleasesOfficial PublicationsSector newsMarketScreener Strategies

Microsoft : Turing-NLRv5 achieves new performance milestones

12/03/2021 | 01:22pm EST

As part of Microsoft AI at Scale, the Turing family of NLP models are being used at scale across Microsoft to enable the next generation of AI experiences. Today, we are happy to announce that the latest Microsoft Turing model (T-NLRv5) is the state of the art at the top of SuperGLUE and GLUE leaderboards, further surpassing human performance and other models. Notably, T-NLRv5 first achieved human parity on MNLI and RTE on the GLUE benchmark, the last two GLUE tasks which human parity had not yet met. In addition, T-NLRv5 is more efficient than recent pretraining models, achieving comparable effectiveness with 50% fewer parameters and pretraining computing costs.

The Turing Natural Language Representation (T-NLRv5) integrates some of the best modeling techniques developed by Microsoft Research, Azure AI, and Microsoft Turing. The models are pretrained at large scale using an efficient training framework based on FastPT and DeepSpeed. We're excited to bring new AI improvements to Microsoft products using these state-of-the-art techniques.

Model architecture and pretraining task

T-NLRv5 is largely based on our recent work, COCO-LM, a natural evolution of pretraining paradigm converging the benefits of ELECTRA-style models and corrective language model pretraining. As illustrated in Figure 2, T-NLRv5 employs an auxiliary transformer language model to corrupt an input text sequence, and the main transformer model is pretrained using the corrective language model task, which is to detect and correct tokens replaced by the auxiliary model. This augments the ELECTRA model family with language modeling capacity, bringing together the benefits from pretraining with adversarial signals generated from the auxiliary model and the language modeling capacity, which is handy for prompt-based learning.

We also leverage the training dataset and the data processing pipeline optimized for developing previous T-NLR releases, including DeBERTa and UniLM, as well as the implementation optimizations from other Microsoft pretraining research efforts, such as TUPE.

Another key property of T-NLRv5 is that it maintains the effectiveness of the model at smaller sizes, e.g., base and large size with a few hundred million parameters, to bigger sizes with billions of parameters. This is achieved by careful selection of techniques of maintaining model simplicity and optimization stability. We disabled dropout in the auxiliary model so that the pretraining of the auxiliary model and the generation of the main model's training data are done in one pass. We also disabled the sequential contrastive learning task in COCO-LM to reduce computing cost. This enables us to stick to the post-layer norm transformer architecture that allows us to train deeper transformer networks more thoroughly.

Efficiently scaling up language model pretraining

Training billion-parameter neural models can be prohibitively expensive in both time and computing costs. This yields a long experimental cycle that slows down scientific developments and raises cost-benefit concerns. In making T-NLRv5, we leveraged two approaches to improve its scaling efficiency to ensure optimal use of model parameters and pretraining compute.

Customized CUDA kernels for mixed precision. We leverage the customized CUDA kernels developed for Fast PreTraining (FastPT), which are customized for transformer architecture and optimized for the speed in mixed precision (FP16) pretraining. This not only significantly improves the efficiency of transformer training and inference by 20%, but also provides better numerical stability in mixed-precision training. The latter is one of the most important needs when pretraining language representation models with billions of parameters.

ZeRO optimizer. When scaling up T-NLRv5 to billions of parameters, we bring in our ZeRO optimizer technique of DeepSpeed, described in a previous blog post, to reduce the GPU memory footprint of pretraining models in multi-machine parallel pretraining processes. Specifically, T-NLRv5 XXL (5.4 billion) version uses ZeRO optimizer stage 1 (optimizer stage partitioning), which reduces the GPU memory footprint by five times.

Achieving best effectiveness and efficiency simultaneously

By combining the above modeling techniques and infrastructure improvements, T-NLRv5 provides the best effectiveness and efficiency simultaneously at various trade-off points. To the best of our knowledge, T-NLRv5 achieves state-of-the-art effectiveness at various model sizes and pretraining computing costs.

The model configurations for T-NLRv5 variants are displayed in Table 1. As shown in Figure 4 and Figure 5, when measured on MNLI, one of the most stable tasks on GLUE, T-NLRv5 variants with substantially fewer parameters or computing steps often significantly outperform previous pretraining models with larger pretraining costs. T-NLRv5's base version outperforms RoBERTa Large using 50% of the parameters. Using 434 million parameters, T-NLRv5 Large performs on par with DeBERTa XL (1.5 billion parameters) and outperforms Megatron encoder with 3.9 billion parameters. T-NLRv5 also significantly improves pretraining efficiency: it reaches the accuracy of our latest XL model, T-NLRv4-1.5B with only 40% pretraining steps using the same training corpora and computing environments.

Robust model adaptation

Robustness is important for a model to perform well on test samples, which are dramatically different from training data. In this work, we use two methods to improve the robustness of adapting T-NLRv5 to downstream tasks. The first method enhances model robustness through PDR (posterior differential regularization), which regularizes the model posterior difference between clean and noisy inputs during model training. The second method is multi-task learning, as in multi-task deep neural network (MT-DNN), which improves model robustness by learning representations across multiple NLU tasks. MT-DNN not only leverages large amounts of cross-task data, but also benefits from a regularization effect that leads to more general representations in order to adapt to new tasks and domains.

With these robust model adaptation techniques, our T-NLRv5 XXL model is the first to reach human parity on MNLI in test accuracy (92.5 versus 92.4), the most informative task on GLUE, while only using a single model and single task fine-tuning, i.e., without ensemble.

Table 2 presents some examples from MNLI dev-mismatched set where the T-NLRv5 XXL model can predict the correct label, but one of our authors made the wrong prediction. These are quite difficult examples, and we are glad to see T-NLRv5 XXL can accurately complete the task.

T-NLRv5: Release Information

We will make T-NLRv5 and its capabilities available in the same way as with other Microsoft Turing models.
We will leverage its increased capabilities to further improve the execution of popular language tasks in Azure Cognitive Services. Customers will automatically benefit from these.

Customers interested in using Turing models for their own specific task can submit a request to join the Turing Private Preview. Finally, we will make T-NLRv5 available to researchers for collaborative projects via the Microsoft Turing Academic Program.

Learn more:

Explore an interactive demo with AI at Scale models

Learn more about the technology layers that power AI at Scale models

See how DeBERTa, part of Microsoft's Turing family of models, performs against SuperGLUE tasks

Conclusion: Building and democratizing more inclusive AI

The Microsoft Turing model family plays an important role in delivering language-based AI experiences in Microsoft products. T-NLRv5 further surpassing human performance on SuperGLUE and GLUE leaderboards reaffirms our commitment to keep pushing the boundaries of NLP and continuously improving these models so that we can ultimately bring smarter, more responsible AI product experiences to our customers.

We welcome your feedback and look forward to sharing more developments in the future.


Microsoft Corporation published this content on 03 December 2021 and is solely responsible for the information contained therein. Distributed by Public, unedited and unaltered, on 03 December 2021 18:21:08 UTC.

ę Publicnow 2021
05:55pWALL STREET STOCK EXCHANGE : S&P ends down after another wild session
04:42pWhite House to host GM, Ford among CEOs at meeting on spending push
04:41pUS Stocks Close Lower as Fed Begins Two-Day Policy Meeting
04:19pMicrosoft Posts Higher EPS, Revenue in Fiscal Q2
04:18pMICROSOFT : Fiscal Q2 Earnings Snapshot
04:18pMicrosoft earnings press release available on Investor Relations website
04:13pMicrosoft Servers and Azure Q2 Revenue Increased 29%, vs. Analyst Consensus of 28.5% on..
04:11pMicrosoft Q2 Intelligent Cloud Revenue at $18.3 Billion, in Line with Consensus on Visi..
04:10pMicrosoft LinkedIn Q2 Revenue Increases 37%, vs. Analyst Consensus on Visible Alpha of ..
More news
Analyst Recommendations on MICROSOFT CORPORATION
More recommendations
Financials (USD)
Sales 2022 197 B - -
Net income 2022 71 463 M - -
Net cash 2022 91 132 M - -
P/E ratio 2022 30,4x
Yield 2022 0,84%
Capitalization 2 166 B 2 166 B -
EV / Sales 2022 10,6x
EV / Sales 2023 9,14x
Nbr of Employees 181 000
Free-Float 99,9%
Duration : Period :
Microsoft Corporation Technical Analysis Chart | MarketScreener
Full-screen chart
Technical analysis trends MICROSOFT CORPORATION
Short TermMid-TermLong Term
Income Statement Evolution
Mean consensus BUY
Number of Analysts 45
Last Close Price 288,49 $
Average target price 373,06 $
Spread / Average Target 29,3%
EPS Revisions
Managers and Directors
Satya Nadella Chairman & Chief Executive Officer
Bradford L. Smith President & Chief Legal Officer
Amy E. Hood Chief Financial Officer & Executive Vice President
James Kevin Scott Chief Technology Officer & Executive VP
Matthias Troyer Partner Research Manager
Sector and Competitors
1st jan.Capi. (M$)
SEA LIMITED-33.25%83 909
SYNOPSYS INC.-16.33%47 311