Release of Datasets v1.0 (Hugging Face)

Release of :hugs:Datasets v1.0

(11/09/2020 - texto copiado/colado da newsletter da Hugging Face) After a summer of hard work, we are releasing :hugs:Datasets v1.0: the first stable version of our datasets and metrics library (known as “nlp” in its beta versions).

This library started as a way to simplify datasets/metrics access for researchers & teachers, and soon became a test bed for efficient and fast data loading & processing.

This v1.0 release brings many interesting features including strong speed improvements, efficient indexing capabilities, multi-modality for image and text datasets as well as many reproducibility and traceability improvements.

Noticeable new features:

  • Pickle support
  • Save and load datasets to/from disk
  • Multiprocessing in map and filter
  • Multi-dimensional arrays support for multi-modal datasets
  • Speed up Tokenization
  • Speed up shuffle/shard/select methods - use indices mappings
  • Speed up download and processing
  • Indexed datasets for hybrid models (REALM/RAG/MARGE)

Many new datasets including:

  • IWSLT 2017
  • CommonGen Dataset
  • CLUE Benchmark (11 datasets)
  • The KILT knowledge source and tasks
  • DailyDialog
  • DoQA dataset (ACL 2020)
  • reuters21578
  • HANS
  • MLSUM
  • Guardian authorship
  • web_questions
  • MS MARCO

Full Changelog can be found here.

Install with pip install datasets

Tutorial, doc, details can be found on the github repository at https://github.com/huggingface/datasets

We would like to give a huge thank you to the amazing community of early contributors and supporters of the “nlp” beta for their help and contributions and in particular to: Stefan Schweter, Thomas Hudson, Jared Nielsen, Jack Morris, Bharat Raghunathan, Richard Wang, Leandro von Werra, Yoav Artzi, Alessandro Suglia, Mohit Bansal, Antonio V Mendoza, Gustavo Aguilar and all the other 54 early contributors!