A Survey Organizing Contextualized Encoders

​ 综述《Which *BERT? A Survey Organizing Contextualized Encoders》(EMNLP 2020)

摘要

  • 提出一个关于语言表征学习的调查,以整合各种成果的经验教训
  • 介绍了如何根据项目需求,选择最近提出的模型/方法

介绍

  • What, besides state-of-the-art, does this newest paper contribute? Which encoder should we use?

预训练任务

  • 预训练任务通常是自监督的,分为token prediction和nontoken prediction

token prediction

  • 预测下一个token

  • ELMo、cloze task(Cloze-driven pretraining of self-attention networks.)、Bert

  • 自回归形式的token预测:xlnet

  • 置换语言建模:Insertion transformer、Realm

  • 实现MLM和置换语言建模架构上统一:MPNet、UnilmV2

  • whole word mask、named entity mask:Bert、Ernie

  • mask random spans of texts:MASS、xlnet

  • 去噪自编码器框架:Bart、Denoising based Sequence-to-Sequence Pre-training for Text Generation

nontoken prediction

  • 认为不能依赖一个语言建模目标,必须有额外信息将文本和实际世界联系起来
  • NSP:BERT
  • 预测A is before, after, or unrelated to B:Structbert、Albert
  • 多任务预训练:Ernie 2.0
  • discourse information: What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties.

效率

训练

  • 引入LAMB优化器,减少BERT训练时间:Large batch optimization for deep learning: Training bert in 76 minutes

推理

  • 稀疏上下文化编码器(在更长的句子上训练):Reformer、Longformer
  • 层权重共享:Albert、Universal transformers
  • 模型剪枝:
    • 在预训练时剪枝:Blockwise self-attention for long document understanding
    • 在下游任务上剪枝:Deformer、Are sixteen heads really better than one?
  • 知识蒸馏:Mobile-Bert、DistilBert
  • 与自适应推理结合的知识蒸馏:FastBert

多语种

  • 使用多语言文本进行预训练
  • Multilingual denoising pre-training for neural machine translation、Crosslingual language model pretraining