Compositional generalization is a key facet of human cognition, but lacking in current AI tools such as vision-language models.
Previous work examined whether a compositional tensorbased sentence semantics can overcome the challenge, but led to negative
results. We conjecture that the increased training efficiency of quantum models will improve performance in these tasks. We
interpret the representations of compositional tensor-based models in Hilbert spaces and train Variational Quantum Circuits
to learn these representations on an image captioning task requiring compositional generalization. We used two image encoding
techniques: a multi-hot encoding (MHE) on binary image vectors and an angle/amplitude encoding on image vectors taken from
the vision-language model CLIP. We achieve good proof-of-concept results using noisy MHE encodings. Performance on CLIP image
vectors was more mixed, but still outperformed classical compositional models.