Nota: aqui está o tópico sobre Aumento de Dados para imagens.
Este tópico trata do assunto de “Aumento de Dados” (Data Augmentation ou DA) para textos, que consiste em aumentar o número de dados de treinamento de um modelo de Deep Learning, o que permite treiná-lo ainda mais ao afastar o problema de sobreajuste (overfitting ou especialização de um modelo para o dataset de treinamento).
Até muito recentemente, a DA era aplicado apenas às imagens usando funções matemáticas para selecionar uma parte da imagem, aplicar transformações lineares a ela, alterar seu brilho ou contraste, etc. Nesse sentido, a fastai atualmente possui uma das bibliotecas DA mais abrangentes aplicadas às imagens.
Mas desde setembro de 2019, as técnicas de DA foram aplicadas também - e com sucesso - aos textos!
Encontre todos os links para essas técnicas de DA para textos abaixo (como este post é um wiki, pode adicionar seus links) e use este tópico para discutir técnicas de DA!
Recent pretrained transformer-based language models have set state-of-the-art performances on various NLP datasets. However, despite their great progress, they suffer from various structural and syntactic biases. In this work, we investigate the lexical overlap bias, e.g., the model classifies two sentences that have a high lexical overlap as entailing regardless of their underlying meaning. To improve the robustness, we enrich input sentences of the training data with their automatically detected predicate-argument structures. This enhanced representation allows the transformer-based models to learn different attention patterns by focusing on and recognizing the major semantically and syntactically important parts of the sentences. We evaluate our solution for the tasks of natural language inference and grounded commonsense inference using the BERT, RoBERTa, and XLNET models. We evaluate the models’ understanding of syntactic variations, antonym relations, and named entities in the presence of lexical overlap. Our results show that the incorporation of predicate-argument structures during fine-tuning considerably improves the robustness, e.g., about 20pp on discriminating different named entities, while it incurs no additional cost at the test time and does not require changing the model or the training procedure.
- Dezembro 2019: Automatic Spanish Translation of the SQuAD Dataset for Multilingual Question Answering
- Autores: Casimiro Pio Carrino, Marta R. Costa-jussà, José A. R. Fonollosa
- Organizações: TALP Research Center, Universitat Polit`ecnica de Catalunya, Barcelona
- Data: Submitted on 11 Dec 2019 (v1), last revised 12 Dec 2019 (this version, v2)
- Código no github: The Translate-Align-Retrieve (TAR) method for synthetic QA corpora generation
Recently, multilingual question answering became a crucial research topic, and it is receiving increased interest in the NLP community. However, the unavailability of large-scale datasets makes it challenging to train multilingual QA systems with performance comparable to the English ones. In this work, we develop the Translate Align Retrieve (TAR) method to automatically translate the Stanford Question Answering Dataset (SQuAD) v1.1 to Spanish. We then used this dataset to train Spanish QA systems by fine-tuning a Multilingual-BERT model. Finally, we evaluated our QA models with the recently proposed MLQA and XQuAD benchmarks for cross-lingual Extractive QA. Experimental results show that our models outperform the previous Multilingual-BERT baselines achieving the new state-of-the-art value of 68.1 F1 points on the Spanish MLQA corpus and 77.6 F1 and 61.8 Exact Match points on the Spanish XQuAD corpus. The resulting, synthetically generated SQuAD-es v1.1 corpora, with almost 100% of data contained in the original English version, to the best of our knowledge, is the first large-scale QA training resource for Spanish.
- Setembro 2019: Unsupervised Data Augmentation for Consistency Training
- Autores: Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, Quoc V. Le
- Organizações: Google Research, Brain Team, Carnegie Mellon University
- Data: 09/30/2019
- Código no github: UDA - Unsupervised Data Augmentation)
Semi-supervised learning lately has shown much promise in improving deep learning models when labeled data is scarce. Common among recent approaches is the use of consistency training on a large amount of unlabeled data to constrain model predictions to be invariant to input noise. In this work, we present a new perspective on how to effectively noise unlabeled examples and argue that the quality of noising, specifically those produced by advanced data augmentation methods, plays a crucial role in semi-supervised learning. By substituting simple noising operations with advanced data augmentation methods, our method brings substantial improvements across six language and three vision tasks under the same consistency training framework. On the IMDb text classification dataset, with only 20 labeled examples, our method achieves an error rate of 4.20, outperforming the state-of-the-art model trained on 25,000 labeled examples. On a standard semi-supervised learning benchmark, CIFAR-10, our method outperforms all previous approaches and achieves an error rate of 2.7% with only 4,000 examples, nearly matching the performance of models trained on 50,000 labeled examples. Our method also combines well with transfer learning, e.g., when finetuning from BERT, and yields improvements in high-data regime, such as ImageNet, whether when there is only 10% labeled data or when a full labeled set with 1.3M extra unlabeled examples is used.
2018: Why should we care about linguistics (Ellie Pavlick)
There are countless examples of how deep learning has shattered previously state-of-the-art results on language processing tasks, including machine translation, question answering, text classification, and parsing. The current optimism surrounding these new techniques has led us to set more ambitious goals, to not just perform text-processing tasks well, but to encode “meaning” in some fundamental, application-independent sense that can prove useful across tasks, architectures, or objective functions.
I argue that while the models we have at our disposal are new, the questions that arise as we attempt to build such task-independent representations are age-old. I will survey a variety of competing models of knowledge representation and inference that have been proposed in the fields of linguistics and cognitive science. I will present some experimental results involving both human subjects and computational NLU systems which illustrate weaknesses in our current models, and highlight why our decisions with respect to these theoretical models matter in practice. I will offer a speculative discussion on why paying better attention to the linguistic and cognitive assumptions we make as we develop new ML architectures can help us make better, faster progress.