Faster and smaller quantized NLP with Hugging Face and ONNX Runtime
(11/09/2020 - texto copiado/colado da newsletter da Hugging Face) Looking to serve transformers models but want to stay on CPU? Check out our newest collaboration with ONNX Runtime, led by ML Engineer Morgan Funtowicz.
Transformers models can now run at the speed of light on commodity CPU servers thanks to quantization support. You can now quantize and export Hugging Face transformers models with a single command-line and leverage all the performance benefits of ONNX Runtime.
We also released a brand new documentation page to highlight the possibilities offered by ONNX/ONNX Runtime and how you can leverage both projects from the transformers repository.
Source: Accelerate your NLP pipelines using Hugging Face Transformers and ONNX Runtime
Projects ONNX (Open Neural Network eXchange) and ONNXRuntime (ORT) are part of an effort from leading industries in the AI field to provide a unified and community-driven format to store and, by extension, efficiently execute neural network leveraging a variety of hardware and dedicated optimizations.
Starting from transformers v2.10.0 we partnered with ONNX Runtime to provide an easy export of transformers models to the ONNX format. You can have a look at the effort by looking at our joint blog post Accelerate your NLP pipelines using Hugging Face Transformers and ONNX Runtime.