A core aspect of compositionality, systematicity is a desirable property in ML models as it enables strong generalization
to novel contexts. This has led to numerous studies proposing benchmarks to assess systematic generalization, as well as models
and training regimes designed to enhance it. Many of these efforts are framed as addressing the challenge posed by Fodor and
Pylyshyn. However, while they argue for systematicity of representations, existing benchmarks and models primarily focus on
the systematicity of behaviour. We emphasize the crucial nature of this distinction. Furthermore, building on Hadley’s (1994)
taxonomy of systematic generalization, we analyze the extent to which behavioural systematicity is tested by key benchmarks
in the literature across language and vision. Finally, we highlight ways of assessing systematicity of representations in
ML models as practiced in the field of mechanistic interpretability.