Estado da arte no NLP e futuro

Keynote: The New Era in NLP | SciPy 2019 | Rachel Thomas

In the past year, we have seen a remarkable number of breakthroughs in the field of natural language processing, including huge leaps forward in classifying, generating, and translating text. Rachel Thomas will share a survey of the field, how it is changing, and what you need to know to get involved.

1 Curtida

The Future of Natural Language Processing (04/22/2020)

A walk through interesting papers and research directions in late 2019/early-2020 by Thomas Wolf (Science lead at HuggingFace) on (slides):

  • model size and computational efficiency,
  • out-of-domain generalization and model evaluation,
  • fine-tuning and sample efficiency,
  • common sense and inductive biases.

We can summarize this talk into 3 key points:

  • Get models tasks independent (check text-to-text Transforme models)
  • Get training datasets with higher diversity of natural language componentes
  • Get models smaller in production with same performance than bigger ones but faster

Evolução dos temas dos papeis científicos no ACL (NLP)

Veja este post.

Summary of the models Transformer in the Hugging Face library

Link to source: https://huggingface.co/transformers/summary.html

This is a summary of the models available in the transformers library. It assumes you’re familiar with the original transformer model. For a gentle introduction check the annotated transformer. Here we focus on the high-level differences between the models. You can check them more in detail in their respective documentation. Also checkout the pretrained model page to see the checkpoints available for each type of model and all the community models.

Each one of the models in the library falls into one of the following categories:

Autoregressive models are pretrained on the classic language modeling task: guess the next token having read all the previous ones. They correspond to the decoder of the original transformer model, and a mask is used on top of the full sentence so that the attention heads can only see what was before in the next, and not what’s after. Although those models can be fine-tuned and achieve great results on many tasks, the most natural application is text generation. A typical example of such models is GPT.

Autoencoding models are pretrained by corrupting the input tokens in some way and trying to reconstruct the original sentence. They correspond to the encoder of the original transformer model in the sense that they get access to the full inputs without any mask. Those models usually build a bidirectional representation of the whole sentence. They can be fine-tuned and achieve great results on many tasks such as text generation, but their most natural application is sentence classification or token classification. A typical example of such models is BERT.

Note that the only difference between autoregressive models and autoencoding models is in the way the model is pretrained. Therefore, the same architecture can be used for both autoregressive and autoencoding models. When a given model has been used for both pretraining, we have put it in the category corresponding to the article it was first introduced.

Sequence-to-sequence models use both the encoder and the decoder of the original transformer, either for translation tasks or by transforming other tasks to sequence-to-sequence problems. They can be fine-tuned to many tasks but their most natural applications are translation, summarization and question answering. The original transformer model is an example of such a model (only for translation), T5 is an example that can be fine-tuned on other tasks.

Multimodal models mix text inputs with other kinds (like image) and are more specific to a given task.

Toward 1 Trillion Parameters

Source: Newsletter The Batch (09/16/2020)

An open source library could spawn trillion-parameter neural networks and help small-time developers build big-league models.

What’s new: Microsoft upgraded DeepSpeed, a library that accelerates the PyTorch deep learning framework. The revision makes it possible to train models five times larger than the framework previously allowed, using relatively few processors, the company said.

How it works: Microsoft debuted DeepSpeed in February, when it used the library to help train the 17 billion-parameter language model Turing-NLG. The new version includes four updates:

  • Three techniques enhance parallelism to use processor resources more efficiently: Data parallelism splits data into smaller batches, model parallelism partitions individual layers, and pipeline parallelism groups layers into stages. Batches, layers, and stages are assigned to so-called worker subroutines for training, making it easier to train extremely large models.
  • ZeRO-Offload efficiently juggles resources available from both conventional processors and graphics chips. The key to this subsystem is the ability to store optimizer states and gradients in CPU, rather than GPU, memory. In tests, a single Nvidia V100 was able to train models with 13 billion parameters without running out of memory — an order of magnitude bigger than PyTorch alone.
  • Sparse Attention uses sparse kernels to process input sequences up to an order of magnitude longer than standard attention allows. In tests, the library enabled Bert and Bert Large models to process such sequences between 1.5 and 3 times faster.
  • 1-bit Adam improves upon the existing Adam optimization method by reducing the volume of communications required. Models that used 1-bit Adam trained 3.5 times faster than those trained using Adam.

Results: Combining these improvements, DeepSpeed can train a trillion-parameter language model using 800 Nvidia V100 graphics cards, Microsoft said. Without DeepSpeed, the same task would require 4,000 Nvidia A100s, which are up to 2.5 times faster than the V100, crunching for 100 days.
Behind the news: Deep learning is spurring a demand for computing power that threatens to put the technology out of many organizations’ reach.

  • A 2018 OpenAI analysis found the amount of computation needed to train large neural networks doubled every three and a half months.
  • A 2019 study from the University of Massachusetts found that high training costs may keep universities and startups from innovating.
  • Semiconductor manufacturing giant Applied Materials estimated that AI’s thirst for processing power could consume 15 percent of electricity worldwide by 2025.

Why it matters: AI giants like Microsoft, OpenAI, and Google use enormous amounts of processing firepower to push the state of the art. Smaller organizations could benefit from technology that helps them contribute as well. Moreover, the planet could use a break from AI’s voracious appetite for electricity.
We’re thinking: GPT-3 showed that we haven’t hit the limit of model and dataset size as drivers of performance. Innovations like this are important to continue making those drivers more broadly accessible.