Re-implementação de ELECTRA com fastai + hugging face

Mensagem copiada/colada do forum fastai.

ELECTRA com fastai + hugging face

After months of development and debugging, I finally successfully train a model from scratch and replicate the results in ELECTRA paper.

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
by Kevin Clark. Minh-Thang Luong. Quoc V. Le. Christopher D. Manning

ELECTRA-Small 54.6 89.1 83.7 80.3 88.0 79.7 87.7 60.8 78.0
ELECTRA-Small (electra_pytorch) 57.2 87.1 82.1 80.4 88 78.9 87.9 63.1 78.08

:computer: Code: electra_pytorch

  • AFAIK, the closest reimplementation to the original one, taking care of many easily overlooked details (described below).
  • AFAIK, the only one successfully validate itself by replicating the results in the paper.
  • Comes with jupyter notebooks, which you can explore the code and inspect the processed data.
  • You don’t need to download and preprocess anything by yourself, all you need is running the training script.

Results :trophy:

Will add more results here.

Advanced details :page_with_curl: (Skip it if you want)

Below lists the details of the original implementation/paper that are easy to be overlooked and I have taken care of. I found these details are indispensable to successfully replicate the results of the paper.

  • Use Adam optimizer without bias correction (bias correction is default for Pytorch and fastai Adam optimizer)
  • There is a bug in how original implementation decays learning rates through layers. See _get_layer_lrs
  • Use clip gradient
  • For MRPC and STS tasks, it appends the same dataset with swapped sentence1 and sentence2 to the original dataset, and call it “double_unordered”
  • For pretraing data preprocessing, it concat and truncate setences to fit the max length, and stop concating when it comes to the end of a document.
  • For pretraing data preprocessing, it by chance split the text into sentence A and sentence B, and also by chance change the max length
  • For finetuning data preprocessing, it follow BERT’s way to truncate the longest one of sentence A and B to fit the max length
  • Use gradient clipping
  • The output layer is initialized by Tensorflow v1’s default initialization which is xavier
  • It use gumbel softmax to sample generations from geneartor
  • It didn’t mask like BERT, but mask for [MASK] for 85% and 15% remains the same
  • It didn’t do warmup and then do linear decay but do them together, which means the learning rate warmups and decays at the same time when warming up. See here
  • It use a dropout and a linear layer for GLUE output layer, not what ElectraClassificationHead uses.
  • It didn’t tie input and output embeddings for its generator, which is a common practice applied by many model.
  • It tie not only word/pos/token type embeddings but also layer norm in embedding layer, for generator and discriminator.
  • All public model of ELECTRA checkpoints are actually ++ model. See this issue
  • It downscales generator by hidden_size, number of attention heads, and intermediate size, but not number of layers.

Need your help :handshake:

Please consider help us on the problems listed below, or tag someone else you think might help.

  • Haven’t success to replicate results of WNLI trick for ELECTRA-Large described in the paper.
  • Haven’t success to apply torch.jit.trace to speed up the model.
  • When I finetune on GLUE (using ), GPU-util is only about 30-40%. I suspect the reason to be small batch and model size (forward pass only takes 1ms) or slow cpu speed ?

About more

The updates of this reimplementation and other tools I created will be tweeted on my Twitter Richard Wang .

Also my personal research based on ELECTRA is underway, hope I can share some good results on Twitter then.