Ruixiang's blog

work harder, study better, do faster, become stronger

0%

长文本分类调研

整理一下之前在腾讯实习时候的一些笔记,顺便再熟悉一下博客怎么上传文章,好久没有搞过,再加上换了新电脑,需要把这个博客源代码迁移一下。。。

长文本的处理在自然语言处理中一直是比较难的问题,在如今BERT火爆的情况下,几乎所有的文本分类任务都会使用BERT作为预训练模型,之后在下游任务中对其进行微调。由于BERT的输入必须是一个固定长度为512的token,所以在面对长文本时大部分都会选择截断的处理方式:比如只截取开头或者截取开头和结尾。但是这样就会损失许多文本中的一些关键信息。本文调研了目前工业界和学术界中不采用简单截断对长文本分类相关问题的解决方案。

解决方案

知乎:https://www.zhihu.com/question/327450789

  • How to Fine-Tune BERT for Text Classification? 这篇论文给了几种对长文本的处理方式:

    Truncation methods, the key information of an article is at the beginning and end. We use three different methods of truncate text to perform
    BERT fine-tuning.

    1. head-only: keep the first 510 tokens6;
  1. tail-only: keep the last 510 tokens;
  2. head+tail: empirically select the first 128
    and the last 382 tokens.

Hierarchical methods, The input text is firstly divided into k = L=510 fractions, which is fed into BERT to obtain the representation of the k text fractions. The representation of each fraction is the hidden state of the [CLS] tokens of the last layer. Then we use mean pooling, max pooling and self-attention to combine the representations of all the fractions.

Welcome to my other publishing channels