The LAMBADA Dataset

Authors

S. Pezzelle

Date (dd-mm-yyyy)

2025

Title

The LAMBADA Dataset

Book title

Reference Module in Social Sciences

Publication Year

2025

Publisher

Elsevier

Document type

Chapter

Abstract

This article introduces the LAMBADA dataset, developed in 2016 to evaluate the ability of computational NLP models to understand texts longer than a single sentence. The dataset consists of passages from unpublished novels where the final word has been masked. While human speakers can easily guess the missing word when provided with the broad context preceding it, this task becomes nearly impossible when only the target sentence is available. At the time of its release, language models performed poorly on LAMBADA, revealing significant gaps in their ability to leverage broader contexts for accurate word prediction. Since its introduction, the landscape of NLP has changed dramatically with the advent of the Transformer architecture that powered a new generation of models trained on next-word prediction as part of the language modeling objective. These models have demonstrated substantial improvements in handling larger contextual information, and LAMBADA has become an essential benchmark for measuring their quality and progress. In this article, I provide a detailed overview of the dataset, its design, and its role in shaping the development of current state-of-the-art language models over the past eight years and continuing to the present day.

URL

go to publisher's site

Permalink

https://hdl.handle.net/11245.1/d7dbfbda-0ec7-4948-a488-9b59e571ab0e