综述《Which *BERT? A Survey Organizing Contextualized Encoders》(EMNLP 2020)
摘要
- 提出一个关于语言表征学习的调查,以整合各种成果的经验教训
- 介绍了如何根据项目需求,选择最近提出的模型/方法
介绍
- What, besides state-of-the-art, does this newest paper contribute? Which encoder should we use?
预训练任务
- 预训练任务通常是自监督的,分为token prediction和nontoken prediction
token prediction
预测下一个token
ELMo、cloze task(Cloze-driven pretraining of self-attention networks.)、Bert
自回归形式的token预测:xlnet
置换语言建模:Insertion transformer、Realm
实现MLM和置换语言建模架构上统一:MPNet、UnilmV2
whole word mask、named entity mask:Bert、Ernie
mask random spans of texts:MASS、xlnet
去噪自编码器框架:Bart、Denoising based Sequence-to-Sequence Pre-training for Text Generation
nontoken prediction
- 认为不能依赖一个语言建模目标,必须有额外信息将文本和实际世界联系起来
- NSP:BERT
- 预测A is before, after, or unrelated to B:Structbert、Albert
- 多任务预训练:Ernie 2.0
- discourse information: What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties.
效率
训练
- 引入LAMB优化器,减少BERT训练时间:Large batch optimization for deep learning: Training bert in 76 minutes
推理
- 稀疏上下文化编码器(在更长的句子上训练):Reformer、Longformer
- 层权重共享:Albert、Universal transformers
- 模型剪枝:
- 在预训练时剪枝:Blockwise self-attention for long document understanding
- 在下游任务上剪枝:Deformer、Are sixteen heads really better than one?
- 知识蒸馏:Mobile-Bert、DistilBert
- 与自适应推理结合的知识蒸馏:FastBert
多语种
- 使用多语言文本进行预训练
- Multilingual denoising pre-training for neural machine translation、Crosslingual language model pretraining