Generative large language models (LLMs) have been shown to exhibit harmful biases and stereotypes. While safety fine-tuning
typically takes place in English, if at all, these models are being used by speakers of many different languages. There is
existing evidence that the performance of these models is inconsistent across languages and that they discriminate based on
demographic factors of the user. Motivated by this, we investigate whether the social stereotypes exhibited by LLMs differ
as a function of the language used to prompt them, while controlling for cultural differences and task accuracy. To this end,
we present MBBQ (Multilingual Bias Benchmark for Question-answering), a carefully curated version of the English BBQ dataset
extended to Dutch, Spanish, and Turkish, which measures stereotypes commonly held across these languages. We further complement
MBBQ with a parallel control dataset to measure task performance on the question-answering task independently of bias. Our
results based on several open-source and proprietary LLMs confirm that some non-English languages suffer from bias more than
English, even when controlling for cultural shifts. Moreover, we observe significant cross-lingual differences in bias behaviour
for all except the most accurate models. With the release of MBBQ, we hope to encourage further research on bias in multilingual
settings. The dataset and code are available at https://github.com/Veranep/MBBQ.