Distributed representations induced from large unlabeled text collections have had a large impact
on many natural language processing (NLP) applications, providing an effective and simple way
of dealing with data sparsity. Word embedding methods [1, 2, 3, 4] typically represent words as
vectors in a low-dimensional space. In contrast, we encode them as probability densities. Intuitively,
the densities will represent the distributions over possible ‘meanings’ of the word. Representing a
word as a distribution has many attractive properties. For example, this lets us encode generality of
terms (e.g., ‘animal’ is a hypernym of ‘dog’), characterize uncertainty about their meaning (e.g., a
proper noun, such as ’John’, encodes little about the person it refers to) or represent polysemy (e.g.,
’tip’ may refer to a gratuity or a sharp edge of an object). Capturing entailment (e.g., ‘run’ entails
’move’) is especially important as it needs to be explicitly or implicitly accounted for in many NLP
applications (e.g., question answering or summarization). Intuitively, distributions provide a natural
way of encoding entailment: the entailment decision can be made by testing the level sets of the
distributions for ‘soft inclusion’(e.g., using the KL divergence [5]).
on many natural language processing (NLP) applications, providing an effective and simple way
of dealing with data sparsity. Word embedding methods [1, 2, 3, 4] typically represent words as
vectors in a low-dimensional space. In contrast, we encode them as probability densities. Intuitively,
the densities will represent the distributions over possible ‘meanings’ of the word. Representing a
word as a distribution has many attractive properties. For example, this lets us encode generality of
terms (e.g., ‘animal’ is a hypernym of ‘dog’), characterize uncertainty about their meaning (e.g., a
proper noun, such as ’John’, encodes little about the person it refers to) or represent polysemy (e.g.,
’tip’ may refer to a gratuity or a sharp edge of an object). Capturing entailment (e.g., ‘run’ entails
’move’) is especially important as it needs to be explicitly or implicitly accounted for in many NLP
applications (e.g., question answering or summarization). Intuitively, distributions provide a natural
way of encoding entailment: the entailment decision can be made by testing the level sets of the
distributions for ‘soft inclusion’(e.g., using the KL divergence [5]).