Microsoft : Turing-NLRv5 achieves new performance milestones

December 03, 2021 at 01:22 pm EST

As part of Microsoft AI at Scale, the Turing family of NLP models are being used at scale across Microsoft to enable the next generation of AI experiences. Today, we are happy to announce that the latest Microsoft Turing model (T-NLRv5) is the state of the art at the top of SuperGLUE and GLUE leaderboards, further surpassing human performance and other models. Notably, T-NLRv5 first achieved human parity on MNLI and RTE on the GLUE benchmark, the last two GLUE tasks which human parity had not yet met. In addition, T-NLRv5 is more efficient than recent pretraining models, achieving comparable effectiveness with 50% fewer parameters and pretraining computing costs.

The Turing Natural Language Representation (T-NLRv5) integrates some of the best modeling techniques developed by Microsoft Research, Azure AI, and Microsoft Turing. The models are pretrained at large scale using an efficient training framework based on FastPT and DeepSpeed. We're excited to bring new AI improvements to Microsoft products using these state-of-the-art techniques.

Model architecture and pretraining task

T-NLRv5 is largely based on our recent work, COCO-LM, a natural evolution of pretraining paradigm converging the benefits of ELECTRA-style models and corrective language model pretraining. As illustrated in Figure 2, T-NLRv5 employs an auxiliary transformer language model to corrupt an input text sequence, and the main transformer model is pretrained using the corrective language model task, which is to detect and correct tokens replaced by the auxiliary model. This augments the ELECTRA model family with language modeling capacity, bringing together the benefits from pretraining with adversarial signals generated from the auxiliary model and the language modeling capacity, which is handy for prompt-based learning.

We also leverage the training dataset and the data processing pipeline optimized for developing previous T-NLR releases, including DeBERTa and UniLM, as well as the implementation optimizations from other Microsoft pretraining research efforts, such as TUPE.

Another key property of T-NLRv5 is that it maintains the effectiveness of the model at smaller sizes, e.g., base and large size with a few hundred million parameters, to bigger sizes with billions of parameters. This is achieved by careful selection of techniques of maintaining model simplicity and optimization stability. We disabled dropout in the auxiliary model so that the pretraining of the auxiliary model and the generation of the main model's training data are done in one pass. We also disabled the sequential contrastive learning task in COCO-LM to reduce computing cost. This enables us to stick to the post-layer norm transformer architecture that allows us to train deeper transformer networks more thoroughly.

Efficiently scaling up language model pretraining

Training billion-parameter neural models can be prohibitively expensive in both time and computing costs. This yields a long experimental cycle that slows down scientific developments and raises cost-benefit concerns. In making T-NLRv5, we leveraged two approaches to improve its scaling efficiency to ensure optimal use of model parameters and pretraining compute.

Customized CUDA kernels for mixed precision. We leverage the customized CUDA kernels developed for Fast PreTraining (FastPT), which are customized for transformer architecture and optimized for the speed in mixed precision (FP16) pretraining. This not only significantly improves the efficiency of transformer training and inference by 20%, but also provides better numerical stability in mixed-precision training. The latter is one of the most important needs when pretraining language representation models with billions of parameters.

ZeRO optimizer. When scaling up T-NLRv5 to billions of parameters, we bring in our ZeRO optimizer technique of DeepSpeed, described in a previous blog post, to reduce the GPU memory footprint of pretraining models in multi-machine parallel pretraining processes. Specifically, T-NLRv5 XXL (5.4 billion) version uses ZeRO optimizer stage 1 (optimizer stage partitioning), which reduces the GPU memory footprint by five times.

Achieving best effectiveness and efficiency simultaneously

By combining the above modeling techniques and infrastructure improvements, T-NLRv5 provides the best effectiveness and efficiency simultaneously at various trade-off points. To the best of our knowledge, T-NLRv5 achieves state-of-the-art effectiveness at various model sizes and pretraining computing costs.

The model configurations for T-NLRv5 variants are displayed in Table 1. As shown in Figure 4 and Figure 5, when measured on MNLI, one of the most stable tasks on GLUE, T-NLRv5 variants with substantially fewer parameters or computing steps often significantly outperform previous pretraining models with larger pretraining costs. T-NLRv5's base version outperforms RoBERTa Large using 50% of the parameters. Using 434 million parameters, T-NLRv5 Large performs on par with DeBERTa XL (1.5 billion parameters) and outperforms Megatron encoder with 3.9 billion parameters. T-NLRv5 also significantly improves pretraining efficiency: it reaches the accuracy of our latest XL model, T-NLRv4-1.5B with only 40% pretraining steps using the same training corpora and computing environments.

Robust model adaptation

Robustness is important for a model to perform well on test samples, which are dramatically different from training data. In this work, we use two methods to improve the robustness of adapting T-NLRv5 to downstream tasks. The first method enhances model robustness through PDR (posterior differential regularization), which regularizes the model posterior difference between clean and noisy inputs during model training. The second method is multi-task learning, as in multi-task deep neural network (MT-DNN), which improves model robustness by learning representations across multiple NLU tasks. MT-DNN not only leverages large amounts of cross-task data, but also benefits from a regularization effect that leads to more general representations in order to adapt to new tasks and domains.

With these robust model adaptation techniques, our T-NLRv5 XXL model is the first to reach human parity on MNLI in test accuracy (92.5 versus 92.4), the most informative task on GLUE, while only using a single model and single task fine-tuning, i.e., without ensemble.

Table 2 presents some examples from MNLI dev-mismatched set where the T-NLRv5 XXL model can predict the correct label, but one of our authors made the wrong prediction. These are quite difficult examples, and we are glad to see T-NLRv5 XXL can accurately complete the task.

T-NLRv5: Release Information

We will make T-NLRv5 and its capabilities available in the same way as with other Microsoft Turing models.
We will leverage its increased capabilities to further improve the execution of popular language tasks in Azure Cognitive Services. Customers will automatically benefit from these.

Customers interested in using Turing models for their own specific task can submit a request to join the Turing Private Preview. Finally, we will make T-NLRv5 available to researchers for collaborative projects via the Microsoft Turing Academic Program.

Learn more:

Explore an interactive demo with AI at Scale models

Learn more about the technology layers that power AI at Scale models

See how DeBERTa, part of Microsoft's Turing family of models, performs against SuperGLUE tasks

Conclusion: Building and democratizing more inclusive AI

The Microsoft Turing model family plays an important role in delivering language-based AI experiences in Microsoft products. T-NLRv5 further surpassing human performance on SuperGLUE and GLUE leaderboards reaffirms our commitment to keep pushing the boundaries of NLP and continuously improving these models so that we can ultimately bring smarter, more responsible AI product experiences to our customers.

We welcome your feedback and look forward to sharing more developments in the future.

Attachments

Original Link
Original Document
Permalink

Disclaimer

Microsoft Corporation published this content on 03 December 2021 and is solely responsible for the information contained therein. Distributed by Public, unedited and unaltered, on 03 December 2021 18:21:08 UTC.

	1st Jan change	Capi.
MICROSOFT CORPORATION	+8.77%	2,965B
SYNOPSYS INC.	+6.22%	80.86B
CADENCE DESIGN SYSTEMS, INC.	+4.39%	75.4B
DASSAULT SYSTÈMES SE	-14.62%	52.54B
ATLASSIAN CORPORATION	-24.22%	51.47B
PALANTIR TECHNOLOGIES INC.	+30.69%	48.04B
THE TRADE DESK, INC.	+18.39%	40.73B
SEA LIMITED	+55.53%	35.61B
TAKE-TWO INTERACTIVE SOFTWARE, INC.	-10.00%	24.47B
ROBLOX CORPORATION	-21.02%	22.73B

1st Jan change

Capi.

MICROSOFT CORPORATION

+8.77%

2,965B

SYNOPSYS INC.

+6.22%

80.86B

CADENCE DESIGN SYSTEMS, INC.

+4.39%

75.4B

DASSAULT SYSTÈMES SE

-14.62%

52.54B

ATLASSIAN CORPORATION

-24.22%

51.47B

PALANTIR TECHNOLOGIES INC.

+30.69%

48.04B

THE TRADE DESK, INC.

+18.39%

40.73B

SEA LIMITED

+55.53%

35.61B

TAKE-TWO INTERACTIVE SOFTWARE, INC.

-10.00%

24.47B

ROBLOX CORPORATION

-21.02%

22.73B

Real-time Estimate Cboe BZX Other stock markets 03:30:10 2024-04-26 pm EDT			5-day change	1st Jan Change
409.6 ^USD	+2.65%		+2.48%	+8.77%

09:23pm	Microsoft's High Capital Spending Likely Supports Multiyear AI Cycle, RBC Says	MT
09:13pm	US Homeland Security names AI safety, security advisory board	RE

Microsoft's High Capital Spending Likely Supports Multiyear AI Cycle, RBC Says	03:23pm	MT
US Homeland Security names AI safety, security advisory board	03:13pm	RE
Wall Street shares lifted by rally in megacap tech stocks	02:42pm	RE
Equity Markets Rise Intraday Amid Alphabet, Microsoft Gains	02:10pm	MT
Sector Update: Tech Stocks Gain Friday Afternoon	01:53pm	MT
In-Line Inflation Data, Tech Earnings Lift US Equity Indexes	01:35pm	MT
Microsoft Delivers Another Strong Quarter with Impressive Results, Wedbush Says	12:31pm	MT
Top Midday Stories: Google Announces First-Ever Dividend; Oil Giants Down on Earnings Misses; Second Norfolk Southern Union Backs Activist Investor's Board Candidates	12:27pm	MT
US Equity Indexes Rise, Treasury Yields Drop as Fed's Preferred Inflation Measure in Line With Expectations	12:24pm	MT
Wall St rises as Big Tech charges higher	12:04pm	RE
Siltronic makes up most of its price slide in a positive tech environment	11:58am	DP
Homeland Security Department Forms AI Safety Board With Top Tech Executives	11:57am	MT
CAC40: Green end to the week, approaches 8100 pts	11:54am	CF
Estée Lauder: AI innovation lab with Microsoft	11:34am	CF
Wall Street: a bullish end to the week	11:26am	CF
CAC40: week ends well, Nasdaq climbs +2	11:18am	CF
Microsoft: AI innovation lab with Estée Lauder	11:09am	CF
EU's Vestager meets French tech firm Mistral AI amid competition concerns	11:08am	RE
CAC40: strong end to the week, easing interest rates but rising oil prices	10:34am	CF
Estee Lauder, Microsoft Expand Collaboration to Create AI Innovation Lab	09:47am	MT
US equity funds see 4th straight week of outflows amid rate cut reassessment	09:36am	RE
Wall Street: Microsoft and Alphabet reassure, PCE too	09:25am	CF
Dpa-AFX Overview: COMPANIES from 26.04.2024 - 15:15	09:20am	DP
Well, it isn't as bad as feared...	09:16am
Wall St set to open higher on tech boost, PCE data	09:02am	RE

Microsoft Corporation

Equities

MSFT

US5949181045

Software

Microsoft : Turing-NLRv5 achieves new performance milestones

Latest news about Microsoft Corporation

Chart Microsoft Corporation

Company Profile

Income Statement Evolution

Analysis / Opinion

Ratings for Microsoft Corporation

Analysts' Consensus

EPS Revisions

Quarterly earnings - Rate of surprise

Sector Other Software