Pretrained language models have obtained impressive results for a large set of natural language understanding tasks. However,
training these models is computationally expensive and requires huge amounts of data. Thus, it would be desirable to automatically
detect groups of more or less important examples. Here, we investigate if we can leverage sources of information which are
commonly overlooked, Wikipedia categories as listed in DBPedia, to identify useful or harmful data points during pretraining.
We define an experimental setup in which we analyze correlations between language model perplexity on specific clusters and
downstream NLP task performances during pretraining. Our experiments show that Wikipedia categories are not a good indicator
of the importance of specific sentences for pretraining.