The latest generation of LLMs can be prompted to achieve impressive zero-shot or few-shot performance in many NLP tasks. However,
since performance is highly sensitive to the choice of prompts, considerable effort has been devoted to crowd-sourcing prompts
or designing methods for prompt optimisation. Yet, we still lack a systematic understanding of how linguistic properties of
prompts correlate with task performance. In this work, we investigate how LLMs of different sizes, pre-trained and instruction-tuned,
perform on prompts that are semantically equivalent, but vary in linguistic structure. We investigate both grammatical properties
such as mood, tense, aspect and modality, as well as lexico-semantic variation through the use of synonyms. Our findings contradict
the common assumption that LLMs achieve optimal performance on lower perplexity prompts that reflect language use in pretraining
or instruction-tuning data. Prompts transfer poorly between datasets or models, and performance cannot generally be explained
by perplexity, word frequency, ambiguity or prompt length. Based on our results, we put forward a proposal for a more robust
and comprehensive evaluation standard for prompting research.