SimpleBooks: Long-term dependency book dataset with simplified English vocabulary for word-level language modeling
With language modeling becoming the popular base task for unsupervised representation learning in Natural Language Processing, it is important to come up with new architectures and techniques for faster and better training of language models. However, due to a peculiarity of languages -- the larger the dataset, the higher the average number of times a word appears in that dataset -- datasets of different sizes have very different properties. Architectures performing well on small datasets might not perform well on larger ones. For example, LSTM models perform well on WikiText-2 but poorly on WikiText-103, while Transformer models perform well on WikiText-103 but not on WikiText-2. For setups like architectural search, this is a challenge since it is prohibitively costly to run a search on the full dataset but it is not indicative to experiment on smaller ones. In this paper, we introduce SimpleBooks, a small dataset with the average word frequency as high as that of much larger ones. Created from 1,573 Gutenberg books with the highest ratio of word-level book length to vocabulary size, SimpleBooks contains 92M word-level tokens, on par with WikiText-103 (103M tokens), but has the vocabulary of 98K, a third of WikiText-103's. SimpleBooks can be downloaded from https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip.
NurtureToken New!

Token crowdsale for this paper ends in

Buy Nurture Tokens

Author

Are you an author of this paper? Check the Twitter handle we have for you is correct.

Huyen Nguyen (edit)
Ask The Authors

Ask the authors of this paper a question or leave a comment.

Read it. Rate it.
#1. Which part of the paper did you read?

#2. The paper contains new data or analyses that is openly accessible?
#3. The conclusion is supported by the data and analyses?
#4. The conclusion is of scientific interest?
#5. The result is likely to lead to future research?

Github
Stargazers:
1561
Forks:
387
Open Issues:
58
Network:
387
Subscribers:
66
Language:
Python
LSTM and QRNN Language Model Toolkit for PyTorch
Youtube
Link:
None (add)
Views:
0
Likes:
0
Dislikes:
0
Favorites:
0
Comments:
0
Other
Sample Sizes (N=):
Inserted:
Words Total:
Words Unique:
Source:
Abstract:
None
12/01/19 06:09PM
2,620
1,153
Tweets
arxiv_pop: 2019/11/27 投稿 1位 CL(Computation and Language) SimpleBooks: Long-term dependency book dataset with simplified English vocabulary for word-level language modeling https://t.co/ogNQVED82O 5 Tweets 24 Retweets 275 Favorites
roadrunning01: SimpleBooks: Long-term dependency book dataset with simplified English vocabulary for word-level language modeling pdf: https://t.co/Lq8SMebl2w abs: https://t.co/C3Pa67S8Tg link: https://t.co/EopaNHtz9J https://t.co/LG2bRFP5ss
arxiv_cscl: SimpleBooks: Long-term dependency book dataset with simplified English vocabulary for word-level language modeling https://t.co/qZRwNV5Rjd
arxiv_cscl: SimpleBooks: Long-term dependency book dataset with simplified English vocabulary for word-level language modeling https://t.co/qZRwNUOfUD
chipro: SimpleBooks is a longterm dependency dataset that is 90% the size of WikiText-103 but has 1/3 vocab and 1/4 OOV. I created it last year to test, benchmark, & do tutorials for word-level language models but didn't publish it bc small datasets get 0 love 😅 https://t.co/3TNA2xoz5Z https://t.co/8zCdf1ovdd
Images
Related