长文本分类调研

整理一下之前在腾讯实习时候的一些笔记，顺便再熟悉一下博客怎么上传文章，好久没有搞过，再加上换了新电脑，需要把这个博客源代码迁移一下。。。

长文本的处理在自然语言处理中一直是比较难的问题，在如今BERT火爆的情况下，几乎所有的文本分类任务都会使用BERT作为预训练模型，之后在下游任务中对其进行微调。由于BERT的输入必须是一个固定长度为512的token，所以在面对长文本时大部分都会选择截断的处理方式：比如只截取开头或者截取开头和结尾。但是这样就会损失许多文本中的一些关键信息。本文调研了目前工业界和学术界中不采用简单截断对长文本分类相关问题的解决方案。

解决方案

知乎：https://www.zhihu.com/question/327450789

HIERARCHICAL TRANSFORMERS FOR LONG DOCUMENT CLASSIFICATION

Two extensions - RoBERT and ToBERT - to the BERT model, which enable its application in classification of long texts by performing segmentation and using another layer on top of the segment representations

中文实验：https://blog.csdn.net/valleria/article/details/105311340

How to Fine-Tune BERT for Text Classification? 这篇论文给了几种对长文本的处理方式：

Truncation methods, the key information of an article is at the beginning and end. We use three different methods of truncate text to perform
BERT fine-tuning.
1. head-only: keep the first 510 tokens6;

tail-only: keep the last 510 tokens;
head+tail: empirically select the first 128
and the last 382 tokens.

Hierarchical methods， The input text is firstly divided into k = L=510 fractions, which is fed into BERT to obtain the representation of the k text fractions. The representation of each fraction is the hidden state of the [CLS] tokens of the last layer. Then we use mean pooling, max pooling and self-attention to combine the representations of all the fractions.

Multi-passage BERT: A Globally Normalized BERT Model for Open-domain Question Answering
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
PARADE: Passage Representation Aggregation for Document Reranking
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

这个问题现在的解决方法是用Sliding Window（划窗），主要见于诸阅读理解任务（如Stanford的SQuAD)。Sliding Window即把文档分成有重叠的若干段，然后每一段都当作独立的文档送入BERT进行处理。最后再对于这些独立文档得到的结果进行整合。
Sliding Window可以只用在Training中。因为Test之时不需要Back Propagation，亦不需要large batch_size，因而总有手段将长文本塞进显存中（如torch.nograd, batch_size=1）.至于具体实现可以参考原始BERT的run_squad.py :

https://github.com/google-research/bert/blob/master/run_squad.py