A Survey Organizing Contextualized Encoders

发表于 2022-04-13 分类于论文阅读次数：
本文字数： 1.3k 阅读时长 ≈ 1 分钟

综述《Which *BERT? A Survey Organizing Contextualized Encoders》（EMNLP 2020）

摘要

提出一个关于语言表征学习的调查，以整合各种成果的经验教训
介绍了如何根据项目需求，选择最近提出的模型/方法

介绍

What, besides state-of-the-art, does this newest paper contribute? Which encoder should we use?

预训练任务

预训练任务通常是自监督的，分为token prediction和nontoken prediction

token prediction

预测下一个token
ELMo、cloze task（Cloze-driven pretraining of self-attention networks.）、Bert
自回归形式的token预测：xlnet
置换语言建模：Insertion transformer、Realm
实现MLM和置换语言建模架构上统一：MPNet、UnilmV2
whole word mask、named entity mask：Bert、Ernie
mask random spans of texts：MASS、xlnet
去噪自编码器框架：Bart、Denoising based Sequence-to-Sequence Pre-training for Text Generation

nontoken prediction

认为不能依赖一个语言建模目标，必须有额外信息将文本和实际世界联系起来
NSP：BERT
预测A is before, after, or unrelated to B：Structbert、Albert
多任务预训练：Ernie 2.0
discourse information： What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties.

效率

训练

引入LAMB优化器，减少BERT训练时间：Large batch optimization for deep learning: Training bert in 76 minutes

推理

稀疏上下文化编码器（在更长的句子上训练）：Reformer、Longformer
层权重共享：Albert、Universal transformers
模型剪枝：
- 在预训练时剪枝：Blockwise self-attention for long document understanding
- 在下游任务上剪枝：Deformer、Are sixteen heads really better than one?
知识蒸馏：Mobile-Bert、DistilBert
与自适应推理结合的知识蒸馏：FastBert

多语种

使用多语言文本进行预训练
Multilingual denoising pre-training for neural machine translation、Crosslingual language model pretraining

本文作者： Thomas-Li
本文链接： https://thomas-li-sjtu.github.io/2022/04/13/A-Survey-Organizing-Contextualized-Encoders/
版权声明： 本博客所有文章除特别声明外，均采用 BY-NC-SA 许可协议。转载请注明出处！