Summary of the models Transformer in the Hugging Face library
Link to source: https://huggingface.co/transformers/summary.html
This is a summary of the models available in the transformers library. It assumes you’re familiar with the original transformer model. For a gentle introduction check the annotated transformer. Here we focus on the high-level differences between the models. You can check them more in detail in their respective documentation. Also checkout the pretrained model page to see the checkpoints available for each type of model and all the community models.
Each one of the models in the library falls into one of the following categories:
Autoregressive models are pretrained on the classic language modeling task: guess the next token having read all the previous ones. They correspond to the decoder of the original transformer model, and a mask is used on top of the full sentence so that the attention heads can only see what was before in the next, and not what’s after. Although those models can be fine-tuned and achieve great results on many tasks, the most natural application is text generation. A typical example of such models is GPT.
Autoencoding models are pretrained by corrupting the input tokens in some way and trying to reconstruct the original sentence. They correspond to the encoder of the original transformer model in the sense that they get access to the full inputs without any mask. Those models usually build a bidirectional representation of the whole sentence. They can be fine-tuned and achieve great results on many tasks such as text generation, but their most natural application is sentence classification or token classification. A typical example of such models is BERT.
Note that the only difference between autoregressive models and autoencoding models is in the way the model is pretrained. Therefore, the same architecture can be used for both autoregressive and autoencoding models. When a given model has been used for both pretraining, we have put it in the category corresponding to the article it was first introduced.
Sequence-to-sequence models use both the encoder and the decoder of the original transformer, either for translation tasks or by transforming other tasks to sequence-to-sequence problems. They can be fine-tuned to many tasks but their most natural applications are translation, summarization and question answering. The original transformer model is an example of such a model (only for translation), T5 is an example that can be fine-tuned on other tasks.
Multimodal models mix text inputs with other kinds (like image) and are more specific to a given task.