Fast Transformer Decoding: One Write-Head is All You Need
Multi-head attention layers, as used in the Transformer neural sequence model, are a powerful alternative to RNNs for moving information across and between sequences. While training these layers is generally fast and simple, due to parallelizability across the length of the sequence, incremental inference (where such paralleization is impossible) is often slow, due to the memory-bandwidth cost of repeatedly loading the large "keys" and "values" tensors. We propose a variant called multi-query attention, where the keys and values are shared across all of the different attention "heads", greatly reducing the size of these tensors and hence the memory bandwidth requirements of incremental decoding. We verify experimentally that the resulting models can indeed be much faster to decode, and incur only minor quality degradation from the baseline.
NurtureToken New!

Token crowdsale for this paper ends in

Buy Nurture Tokens

Author

Are you an author of this paper? Check the Twitter handle we have for you is correct.

Noam Shazeer (add twitter)
Ask The Authors

Ask the authors of this paper a question or leave a comment.

Read it. Rate it.
#1. Which part of the paper did you read?

#2. The paper contains new data or analyses that is openly accessible?
#3. The conclusion is supported by the data and analyses?
#4. The conclusion is of scientific interest?
#5. The result is likely to lead to future research?

Github
User:
None (add)
Repo:
None (add)
Stargazers:
0
Forks:
0
Open Issues:
0
Network:
0
Subscribers:
0
Language:
None
Youtube
Link:
None (add)
Views:
0
Likes:
0
Dislikes:
0
Favorites:
0
Comments:
0
Other
Sample Sizes (N=):
Inserted:
Words Total:
Words Unique:
Source:
Abstract:
None
11/06/19 06:04PM
4,581
1,182
Tweets
quantumbtc: RT @AkiraTOSEI: https://t.co/BVrwPsI7oe TransformerのSelf-Attentionにおいて、各ヘッドでkeyとvalueを共有するMalti-query Attentionを提案。key valueのロード時間が大きいため、そこ…
FBWM8888: RT @AkiraTOSEI: https://t.co/BVrwPsI7oe TransformerのSelf-Attentionにおいて、各ヘッドでkeyとvalueを共有するMalti-query Attentionを提案。key valueのロード時間が大きいため、そこ…
FBWM8888: RT @AkiraTOSEI: https://t.co/BVrwPsI7oe they propose Multi-query Attention which is same as Self-attention but where each head shares keys…
arxiv_pop: 2019/11/05 投稿 1位 NE(Neural and Evolutionary Computing) Fast Transformer Decoding: One Write-Head is All You Need https://t.co/XUzvZYggDu 13 Tweets 73 Retweets 440 Favorites
zimei_no_ri: RT @AkiraTOSEI: https://t.co/BVrwPsI7oe TransformerのSelf-Attentionにおいて、各ヘッドでkeyとvalueを共有するMalti-query Attentionを提案。key valueのロード時間が大きいため、そこ…
AkiraTOSEI: https://t.co/BVrwPsI7oe TransformerのSelf-Attentionにおいて、各ヘッドでkeyとvalueを共有するMalti-query Attentionを提案。key valueのロード時間が大きいため、そこを共有することで高速化が可能になる。精度を落とさずに10倍以上の高速化を実現。 https://t.co/TCV3k9Todk
AkiraTOSEI: https://t.co/BVrwPsI7oe they propose Multi-query Attention which is same as Self-attention but where each head shares keys and values. Since the reload time of those is large, they can speed up by sharing. Achieves a speed increase of 10 times or more without sacrificing accuracy https://t.co/ELIVVsVKD0
Tdys13: 今日はこれを読みたい Fast Transformer Decoding: One Write-Head is All You Need https://t.co/LWiZardZMk
arxiv_cs_LG: Fast Transformer Decoding: One Write-Head is All You Need. Noam Shazeer https://t.co/CQTYMEahpO
hillbig: Multi-head attention is slow for incremental inference because of the high memory-bandwidth cost for loading large keys and values at each position. Multi-query attention is fast with minor degradation by sharing keys and values between different heads. https://t.co/8LmfYlTvNz
hillbig: 一般にTransformerは表現力を上げるため複数ヘッドを使い、各位置でヘッド数分、キーと値を出力するが、サイズが大きくメモリバンド幅律速となり逐次的な推論が遅かった。代わりにヘッド間でキーと値は共有し、クエリだけ複数使うと精度は少し落ちるが高速にできる https://t.co/8LmfYlTvNz
ryo_masumura: RT @arxiv_cscl: Fast Transformer Decoding: One Write-Head is All You Need https://t.co/HiTgwIhrOq
arxiv_cscl: Fast Transformer Decoding: One Write-Head is All You Need https://t.co/HiTgwIhrOq
arxiv_cscl: Fast Transformer Decoding: One Write-Head is All You Need https://t.co/HiTgwHZQpQ
Miles_Brundage: Naturally, this is a single-authored paper: "Fast Transformer Decoding: One Write-Head is All You Need," Noam Shazeer: https://t.co/dYwkV3LGFx
Memoirs: Fast Transformer Decoding: One Write-Head is All You Need. https://t.co/Lp0rf7qqgC
arxivml: "Fast Transformer Decoding: One Write-Head is All You Need", Noam Shazeer https://t.co/hk634Y2RGT
arxiv_cscl: Fast Transformer Decoding: One Write-Head is All You Need https://t.co/HiTgwIhrOq
Images
Related