With the broader use of language models (LMs) comes the need to estimate their ability to respond reliably to prompts (e.g.,
are generated responses likely to be correct?). Uncertainty quantification tools (notions of confidence and entropy, i.a.)
can be used to that end (e.g., to reject a response when the model is ‘uncertain’). For example, Kuhn et al. (semantic entropy;
2022b) regard semantic variation amongst sampled responses as evidence that the model ‘struggles’ with the prompt and that
the LM is likely to err. We argue that semantic variability need not imply error— this being especially intuitive in open-ended
settings, where prompts elicit multiple adequate but semantically dis- tinct responses. Hence, we propose to annotate sampled
responses for their adequacy to the prompt (e.g., using a classifier) and estimate the Probability the model assigns to Adequate
Responses (PROBAR), which we then regard as an indicator of the model’s re- liability at the instance level. We evaluate PROBAR
as a measure of confidence in selective prediction with OPT models (in two QA datasets and in next-word prediction, for English)
and find PROBAR to outperform semantic entropy across prompts with varying degrees of ambiguity/open-endedness.