During current advancement of LLM, many different attention methods beyond Multi-Head Attention (MQA) from original Transformer model have been proposed, such as Multi-Query Attention (MQA) from Falcon, Grouped-Query Attention (GQA) from Llama and Sliding-Window Attention (SWA) from Mistral. Both MQA and GQA aim to save GPU memory (i.e. reduce the size of Key & Value projection matrices during attention) and speed up attention calculation (i.e. reduce size of KV cache so that read data faster and support for large batch size) without too much model performance degradation. Sliding-Window attention (SWA) is a technique used in transformer models to limit the attention span of each token to a fixed size window around it, which reduces the computational complexity and makes the model more efficient.
This blog will implement all these different attention mechanisms from scratch using PyTorch.
MoE (Mix-of-Expert) Model
This image shows the basic structure of MoE. (source)
KV Cache in Transformer Inference
KV Cache is an important technique used in transformer-decoder based models during inference, which can save much computation cost and make the inference faster.
Since the transformer-decoder based model is auto-regressive model, i.e. it generates new token one by one at each time step. Every time we get a new token, we append this new token to the input tokens of the model to generate a sequence of output tokens, the last token of output will be another new token. We repeat this generation process until we get the end of text token or hit to the limit of max sequence length.
During every inference step, we found we are only interested in the last token of model output tokens, which is a generated new token. But the old tokens have been generated again and again during each inference step due to the sequence to sequence mechanism of transformer-decoder based models. The model needs access to all the previous tokens to decide on which token to output during the masked self-attention stage, since they constitute its previous context.
Therefore, KV cache has been introduced to help model avoid doing repetitive computation in the attention calculation of inference and save GPU memory during inference.
Coding Transformer model from scratch
Here is a simple PyTorch implementation of transformer model.
Basically the implementation has following components:
- Building Blocks: Multi-Head Attention, Position-Wise Feed-Forward Networks, Positional Encoding
- Building Encoder and Decoder layers
- Combining Encoder and Decoder layers to create complete Transformer model
Coding Llama2 model from scratch
This blog will implement decoder-based model llama2 from scratch with PyTorch to get better understanding of model structure and some tricks used in the model.
Multi-Head Self Attention Implementation
Implementation of ML & DL & NLP algorithms (WIP)
This blog aims to implement some important Machine Learning, Deep Learning, Natural Language Processing, Large Language Models related algorithms from scratch, which can help people get deeper understanding about how these algorithms work in real code.
2021秋招回顾
最近才有忙里偷闲以博客的形式回顾一下2021年秋招的经历,趁着记忆勉强热乎,尽管已经过去了三个多月,我也已经有三个月没有参加过面试。这大半年来一直在eBay这边实习,十二月份我的毕设也开题了,接下来就是肝完毕业🎓。比较令我想不到是毕设开题都给我开了四个月,实在是拖得有够久。所以再有人问我来亚琛怎么样,我会劝退他如果追求性价比和以找工作为目的的话不要来。。。闲话少说,开始正题。
长文本分类调研
整理一下之前在腾讯实习时候的一些笔记,顺便再熟悉一下博客怎么上传文章,好久没有搞过,再加上换了新电脑,需要把这个博客源代码迁移一下。。。
长文本的处理在自然语言处理中一直是比较难的问题,在如今BERT火爆的情况下,几乎所有的文本分类任务都会使用BERT作为预训练模型,之后在下游任务中对其进行微调。由于BERT的输入必须是一个固定长度为512的token,所以在面对长文本时大部分都会选择截断的处理方式:比如只截取开头或者截取开头和结尾。但是这样就会损失许多文本中的一些关键信息。本文调研了目前工业界和学术界中不采用简单截断对长文本分类相关问题的解决方案。