Release of Datasets v1.0
(11/09/2020 - texto copiado/colado da newsletter da Hugging Face) After a summer of hard work, we are releasing Datasets v1.0: the first stable version of our datasets and metrics library (known as “nlp” in its beta versions).
This library started as a way to simplify datasets/metrics access for researchers & teachers, and soon became a test bed for efficient and fast data loading & processing.
This v1.0 release brings many interesting features including strong speed improvements, efficient indexing capabilities, multi-modality for image and text datasets as well as many reproducibility and traceability improvements.
Noticeable new features:
- Pickle support
- Save and load datasets to/from disk
- Multiprocessing in map and filter
- Multi-dimensional arrays support for multi-modal datasets
- Speed up Tokenization
- Speed up shuffle/shard/select methods - use indices mappings
- Speed up download and processing
- Indexed datasets for hybrid models (REALM/RAG/MARGE)
Many new datasets including:
- IWSLT 2017
- CommonGen Dataset
- CLUE Benchmark (11 datasets)
- The KILT knowledge source and tasks
- DoQA dataset (ACL 2020)
- Guardian authorship
- MS MARCO
Full Changelog can be found here.
pip install datasets
Tutorial, doc, details can be found on the github repository at https://github.com/huggingface/datasets
We would like to give a huge thank you to the amazing community of early contributors and supporters of the “nlp” beta for their help and contributions and in particular to: Stefan Schweter, Thomas Hudson, Jared Nielsen, Jack Morris, Bharat Raghunathan, Richard Wang, Leandro von Werra, Yoav Artzi, Alessandro Suglia, Mohit Bansal, Antonio V Mendoza, Gustavo Aguilar and all the other 54 early contributors!