Extreme classification tasks are multi-label tasks with an extremely large number of labels (tags). These tasks are hard because
the label space is usually (i) very large, e.g. thousands or millions of labels, (ii) very sparse, i.e. very few labels apply
to each input document, and (iii) highly correlated, meaning that the existence of one label changes the likelihood of predicting
all other labels. In this work, we propose a self-attention based variational encoder-model to extract the label-label and
label-feature dependencies jointly and to predict labels for a given input. In more detail, we propose a non-autoregressive
latent variable model and compare it to a strong autoregressive baseline that predicts a label based on all previously generated
labels. Our model can therefore be used to predict all labels in parallel while still including both label-label and label-feature
dependencies through latent variables, and compares favourably to the autoregressive baseline. We apply our models to four
standard extreme classification natural language data sets, and one news videos dataset for automated label detection from
a lexicon of semantic concepts. Experimental results show that although the autoregressive models, where use a given order
of the labels for chain-order label prediction, work great for the small scale labels or the prediction of the highly ranked
label, but our non-autoregressive model surpasses them by around 2% to 6% when we need to predict more labels, or the dataset
has a larger number of the labels.