Mensagem copiada/colada do forum fastai.
ELECTRA com fastai + hugging face
- Code:https://github.com/richarddwang/electra_pytorch…
- Post:https://discuss.huggingface.co/t/electra-training-reimplementation-and-discussion/1004…
After months of development and debugging, I finally successfully train a model from scratch and replicate the results in ELECTRA paper.
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
by Kevin Clark. Minh-Thang Luong. Quoc V. Le. Christopher D. Manning
Model | CoLA | SST | MRPC | STS | QQP | MNLI | QNLI | RTE | Avg. |
---|---|---|---|---|---|---|---|---|---|
ELECTRA-Small | 54.6 | 89.1 | 83.7 | 80.3 | 88.0 | 79.7 | 87.7 | 60.8 | 78.0 |
ELECTRA-Small (electra_pytorch) | 57.2 | 87.1 | 82.1 | 80.4 | 88 | 78.9 | 87.9 | 63.1 | 78.08 |
Code: electra_pytorch
- AFAIK, the closest reimplementation to the original one, taking care of many easily overlooked details (described below).
- AFAIK, the only one successfully validate itself by replicating the results in the paper.
- Comes with jupyter notebooks, which you can explore the code and inspect the processed data.
- You don’t need to download and preprocess anything by yourself, all you need is running the training script.
Results
Will add more results here.
Advanced details (Skip it if you want)
Below lists the details of the original implementation/paper that are easy to be overlooked and I have taken care of. I found these details are indispensable to successfully replicate the results of the paper.
- Use Adam optimizer without bias correction (bias correction is default for Pytorch and fastai Adam optimizer)
- There is a bug in how original implementation decays learning rates through layers. See _get_layer_lrs
- Use clip gradient
- For MRPC and STS tasks, it appends the same dataset with swapped sentence1 and sentence2 to the original dataset, and call it “double_unordered”
- For pretraing data preprocessing, it concat and truncate setences to fit the max length, and stop concating when it comes to the end of a document.
- For pretraing data preprocessing, it by chance split the text into sentence A and sentence B, and also by chance change the max length
- For finetuning data preprocessing, it follow BERT’s way to truncate the longest one of sentence A and B to fit the max length
- Use gradient clipping
- The output layer is initialized by Tensorflow v1’s default initialization which is xavier
- It use gumbel softmax to sample generations from geneartor
- It didn’t mask like BERT, but mask for [MASK] for 85% and 15% remains the same
- It didn’t do warmup and then do linear decay but do them together, which means the learning rate warmups and decays at the same time when warming up. See here
- It use a dropout and a linear layer for GLUE output layer, not what
ElectraClassificationHead
uses. - It didn’t tie input and output embeddings for its generator, which is a common practice applied by many model.
- It tie not only word/pos/token type embeddings but also layer norm in embedding layer, for generator and discriminator.
- All public model of ELECTRA checkpoints are actually ++ model. See this issue
- It downscales generator by hidden_size, number of attention heads, and intermediate size, but not number of layers.
Need your help
Please consider help us on the problems listed below, or tag someone else you think might help.
- Haven’t success to replicate results of WNLI trick for ELECTRA-Large described in the paper.
- Haven’t success to apply
torch.jit.trace
to speed up the model. - When I finetune on GLUE (using
finetune.py
), GPU-util is only about 30-40%. I suspect the reason to be small batch and model size (forward pass only takes 1ms) or slow cpu speed ?
About more
The updates of this reimplementation and other tools I created will be tweeted on my Twitter Richard Wang .
Also my personal research based on ELECTRA is underway, hope I can share some good results on Twitter then.