Authors
S. Pezzelle
Date (dd-mm-yyyy)
2025
Title
The LAMBADA Dataset
Book title
Reference Module in Social Sciences
Publication Year
2025
Publisher
Elsevier
Document type
Chapter
Abstract
This article introduces the LAMBADA dataset, developed in 2016 to evaluate the ability of computational NLP models to understand texts longer than a single sentence. The dataset consists of passages from unpublished novels where the final word has been masked. While human speakers can easily guess the missing word when provided with the broad context preceding it, this task becomes nearly impossible when only the target sentence is available. At the time of its release, language models performed poorly on LAMBADA, revealing significant gaps in their ability to leverage broader contexts for accurate word prediction. Since its introduction, the landscape of NLP has changed dramatically with the advent of the Transformer architecture that powered a new generation of models trained on next-word prediction as part of the language modeling objective. These models have demonstrated substantial improvements in handling larger contextual information, and LAMBADA has become an essential benchmark for measuring their quality and progress. In this article, I provide a detailed overview of the dataset, its design, and its role in shaping the development of current state-of-the-art language models over the past eight years and continuing to the present day.
URL
go to publisher's site
Permalink
https://hdl.handle.net/11245.1/d7dbfbda-0ec7-4948-a488-9b59e571ab0e