Inferring the source of official texts: can SVM beat ULMFiT?

This page holds the dataset and source code described in the paper below:

cc @peluz, @teodecampos

1 Curtida

@peluz e @teodecampos: conhecem a “versão 2” do ULMFit que se chama MultiFit?

Abstract: Pretrained language models are promising particularly for low-resource languages as they only require unlabelled data. However, training existing models requires huge amounts of compute, while pretrained cross-lingual models often underperform on low-resource languages. We propose Multi-lingual language model Fine-Tuning (MultiFiT) to enable practitioners to train and fine-tune language models efficiently in their own language. In addition, we propose a zero-shot method using an existing pretrained cross-lingual model. We evaluate our methods on two widely used cross-lingual classification datasets where they outperform models pretrained on orders of magnitude more data and compute. We release all models and code.

Eu sei que existe, mas não muito além disso hahah. Ainda não li o paper e também não sei sobre o método.

Posso confirmar que o MultiFit produz resultados muito bons.

Eu publiquei no meu github no ano passado modelos de linguagem natural baseados no MultiFit e, em particular, um modelo para o português.