Much progress has been made in representing the meaning of linguistic units such as words, sentences and phrases due to powerful
neural network architectures [1], [2]. These computational representations are high dimensional vectors that are learned such
that units with similar meaning are grouped together more closely in the vector space. As a result, they capture the meaning
of concepts without explicitly being informed about what these concepts entail. These computational representations have improved
performances on a variety of downstream NLP tasks. The question to what extent they are similar to semantic representations
in the human brain has drawn the attention of researchers that are trying to gain insight into human language processing.
Two main approaches for comparing computational representations with human brain activation are encoding/decoding experiments
and representational similarity analysis. Both methods attempt to evaluate whether computational models and the brain use
similar organizational principles to process language by trying to capture similar patterns in their semantic representations.
Finding a correlation between the structures of computational and brain representations may contribute to linguistic, computational,
and cognitive science. Computational models can operationalize and test cognitive hypotheses for human language understanding.
Simultaneously, a better understanding of the human brain enables us to derive more cognitively plausible models [3]. A current
issue is that there is no conventionalized way to evaluate the analysis results. Therefore, we compared different evaluation
methods, using state-of-the-art deep learning models, and tested them on a number of fMRI datasets in order to allow for a
robust comparison. We found that different methods could lead to vastly different results. For example, the way in which pairwise
accuracy is defined could make a difference of 30% in accuracy. These inconsistent results could lead to misleading assumptions
of structural similarity between both models. It is therefore important to make evaluation procedures more transparent.