For language models to generalize correctly to novel expressions, it is critical that they exploit access compositional meanings
when this is justified. Even if we don’t know what a “pelp” is, we can use our knowledge of numbers to understand that “ten
pelps” makes more pelps than “two pelps”. Static word embeddings such as Word2vec made strong, indeed excessive, claims about
compositionality. The SOTA generative, transformer models and graph models, however, go too far in the other direction by
providing no real limits on shifts in meaning due to context. To quantify the additive compositionality, we formalize a two-step,
generalized evaluation that (i) measures the linearity between known entity attributes and their embeddings via canonical
correlation analysis, and (ii) evaluates additive generalization by reconstructing embeddings for unseen attribute combinations
and checking reconstruction metrics such as L2 loss, cosine similarity, and retrieval accuracy. These metrics also capture
failure cases where linear composition breaks down. Sentences, knowledge graphs, and word embeddings are evaluated and tracked
the compositionality across all layers and training stages. Stronger compositional signals are observed in later training
stages across data modalities, and in deeper layers of the transformer-based model before a decline at the top layer. Code
will be publicly available on GitHub upon acceptance.