Importância do NLP além do ingles

Temos nos envolvido há alguns anos em desafios que envolvem aplicação de NLP. Até a construção dos transformers e os modelos que sucederam o BERT, a modelagem seguia uma linha de baixa aderência as peculiaridade de cada idioma. Isso mudou drasticamente.

Para se aproveitar das maravilhas que acompanham os modelos mais recentes, como GPT3, além obviamente de #muito# recurso computacional, é condição a existência de um mar de dados para se obter modelos da linguagem. Hoje, o que vemos com frequência são modelos em inglês. Já os outros 7000+ idiomas existentes no mundo são deixados pelo caminho.

Esse artigo do Sebastian Ruder proporciona uma visão ampla sobre esse problema https://ruder.io/nlp-beyond-english/, jogando luz sobre a importância de levar o NLP para outros idiomas. Ele considera implicações nas áreas, que destaco alguns pontos:

  • Social

What language you speak determines your access to information, education, and even human connections. Even though we think of the Internet as open to everyone, there is a digital language divide between dominant languages (mostly from the Western world) and others. Only a few hundred languages are represented on the web and speakers of minority languages are severely limited in the information available to them.

  • Linguistica

English and the small set of other high-resource languages are in many ways not representative of the world’s other languages. Many resource-rich languages belong to the Indo-European language family, are spoken mostly in the Western world, and are morphologically poor, i.e. information is mostly expressed syntactically, e.g. via a fixed word order and using multiple separate words rather than through variation at the word level.

  • ML

Similarly, neural models often overlook the complexities of morphologically rich languages (Tsarfaty et al., 2020): Subword tokenization performs poorly on languages with reduplication (Vania and Lopez, 2017), byte pair encoding does not align well with morphology (Bostrom and Durrett, 2020), and languages with larger vocabularies are more difficult for language models (Mielke et al., 2019). Differences in grammar, word order, and syntax also cause problems for neural models (Ravfogel et al., 2018; Ahmad et al., 2019; Hu et al., 2020). In addition, we generally assume that pre-trained embeddings readily encode all relevant information, which may not be the case for all languages (Tsarfaty et al., 2020).

  • Cultural e normativa

However, such common sense knowledge may be different for different cultures. For instance, the notion of ‘free’ and ‘non-free’ varies cross-culturally where ‘free’ goods are ones that anyone can use without seeking permission, such as salt in a restaurant. Taboo topics are also different in different cultures. Furthermore, cultures vary in their assessment of relative power and social distance, among many other things (Thomas, 1983). In addition, many real-world situations such as ones included in the COPA dataset (Roemmele et al., 2011) do not match the direct experience of many and equally do not reflect key situations that are obvious background knowledge for many people in the world (Ponti et al., 2020).
Consequently, an agent that was only exposed to English data originating mainly in the Western world may be able to have a reasonable conversation with speakers from Western countries, but conversing with someone from a different culture may lead to pragmatic failures.

  • Cognitiva

Human children can acquire any natural language and their language understanding ability is remarkably consistent across all kinds of languages. In order to achieve human-level language understanding, our models should be able to show the same level of consistency across languages from different language families and typologies.

Comparado-se com outros idiomas, o Portugues até que está bem servido. Talvez nosso desafio esteja mais relacionado ao esforço computacional para se obter os self-sup models. E como resolver o problema para idiomas, cujo dado não existe? Ruder, em sua conclusão, propõe medidas concretas para isso. Vale a pena conferir.

Por fim, recomendo também essa postagem com o status do NLP em diversos idiomas:

1 Curtida