Excited to share our work on Longformer, a scalable transformer model for long-document NLP tasks without chunking/truncation to fit the 512 limit.
Work with @mattthemathman, @armancohan
Code and pretrained model: http://github.com/allenai/longformer
Paper: http://arxiv.org/abs/2004.05150 
(1/3)
We replace the standard self-attention with one that scales linearly with sequence length and that can flexibly adapt to downstream tasks. We continue pretrianing from the RoBERTa checkpoint and evaluate on QA, coref, classification. Pretrained model supports seqlen 4,096
(2/3)
The small model archives sota results on enwik8 and text8 and large model gets close with half the parameters. Longformer's self-attention uses an efficient CUDA kernel that minimizes memory usage (char-lm large model, 23k tokens at training and 32k tokens at evaluation)
(3/3)
About the CUDA kernel, it is implemented using TVM (compiles python code into CUDA), making GPU programming more accessible. I haven't seen TVM being used for NLP before but I can imagine it being useful in sparse model research where you need sparsity patterns not in PyTorch
You can follow @i_beltagy.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: