Document processing employing probabilistic topic modeling of documents represented as text words transformed to a continuous space
First Claim
1. An apparatus comprising:
- an electronic data processing device configured to;
perform a modeling method including;
constructing a set of word embedding transforms by operations including generating a term-document matrix whose elements represent occurrence frequencies for text words in documents of a set of documents and include inverse document frequency (IDF) scaling;
applying the set of word embedding transforms to transform text words of a set of documents into K-dimensional word vectors in order to generate sets or sequences of word vectors representing the documents of the set of documents where K is an integer greater than or equal to two; and
learning a probabilistic topic model comprising a mixture model including M mixture components representing M topics using the sets or sequences of word vectors representing the documents of the set of documents wherein the learned probabilistic topic model operates to assign probabilities for the topics of the probabilistic topic model to an input set or sequence of K-dimensional embedded word vectors; and
perform a document processing method including;
applying the set of word embedding transforms to transform text words of an input document into K-dimensional word vectors in order to generate a set or sequence of word vectors representing the input document; and
applying the learned mixture model to the set or sequence of word vectors representing the input document in order to generate one of (1) a vector or histogram of topic probabilities representing the input document or (2) one or more Fisher vectors representing the input document.
6 Assignments
0 Petitions
Accused Products
Abstract
A set of word embedding transforms are applied to transform text words of a set of documents into K-dimensional word vectors in order to generate sets or sequences of word vectors representing the documents of the set of documents. A probabilistic topic model is learned using the sets or sequences of word vectors representing the documents of the set of documents. The set of word embedding transforms are applied to transform text words of an input document into K-dimensional word vectors in order to generate a set or sequence of word vectors representing the input document. The learned probabilistic topic model is applied to assign probabilities for topics of the probabilistic topic model to the set or sequence of word vectors representing the input document. A document processing operation such as annotation, classification, or similar document retrieval may be performed using the assigned topic probabilities.
19 Citations
24 Claims
-
1. An apparatus comprising:
-
an electronic data processing device configured to; perform a modeling method including; constructing a set of word embedding transforms by operations including generating a term-document matrix whose elements represent occurrence frequencies for text words in documents of a set of documents and include inverse document frequency (IDF) scaling; applying the set of word embedding transforms to transform text words of a set of documents into K-dimensional word vectors in order to generate sets or sequences of word vectors representing the documents of the set of documents where K is an integer greater than or equal to two; and learning a probabilistic topic model comprising a mixture model including M mixture components representing M topics using the sets or sequences of word vectors representing the documents of the set of documents wherein the learned probabilistic topic model operates to assign probabilities for the topics of the probabilistic topic model to an input set or sequence of K-dimensional embedded word vectors; and perform a document processing method including; applying the set of word embedding transforms to transform text words of an input document into K-dimensional word vectors in order to generate a set or sequence of word vectors representing the input document; and applying the learned mixture model to the set or sequence of word vectors representing the input document in order to generate one of (1) a vector or histogram of topic probabilities representing the input document or (2) one or more Fisher vectors representing the input document. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A non-transitory storage medium storing instructions executable by an electronic data processing device to perform operations including:
-
constructing a set of word embedding transforms by applying a dimensionality reduction algorithm to a term-document matrix constructed form a set of training documents and having matrix elements storing word frequencies scaled by a metric indicative of the frequencies of occurrence of the words in the set of documents to generate K-dimensional word vectors corresponding to the text words of the term-document matrix; performing a modeling method including; applying the set of word embedding transforms to transform text words of a set of documents into K-dimensional word vectors in order to generate the sets or sequences of word vectors representing the documents where K is an integer greater than or equal to two; and learning a probabilistic topic model comprising M components representing M topics using the sets or sequences of word vectors representing the documents of the set of documents wherein the learned probabilistic topic model operates to assign probabilities for the topics of the probabilistic topic model to an input set or sequence of K-dimensional embedded word vectors; and performing a document processing method including; applying the set of word embedding transforms to transform text words of an input document into corresponding K-dimensional word vector in order to generate a set or sequence of word vectors representing the input document; and applying the learned probabilistic topic model to the set or sequence of word vectors representing the input document in order to generate a topic model representation of the input document comprising one of (1) a vector or histogram of topic probabilities representing the input document or (2) one or more Fisher vectors representing the input document. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
-
-
20. A method comprising:
-
(i) generating a probabilistic topic model comprising M topics by operations including; constructing a set of word embedding transforms used in the embedding operation by applying a dimensionality reduction algorithm to a term frequency-inverse document frequency matrix constructed form a set of training documents to generate K-dimensional word vectors corresponding to the text words of the term-document matrix; applying the set of word embedding transforms to transform text words of a set of documents into K dimensional word vectors in order to generate sets or sequences of word vectors representing the documents of the set of documents where K is an integer greater than or equal to two; and learning the probabilistic topic model using the sets or sequences of word vectors representing the documents of the set of documents wherein the learned probabilistic topic model operates to assign probabilities for the topics of the probabilistic topic model to an input set or sequence of K-dimensional embedded word vectors; and (ii) processing an input document by operations including; applying the set of word embedding transforms to transform text words of the input document into corresponding K-dimensional word vectors in order to generate a set or sequence of word vectors representing the input document; and applying the learned probabilistic topic model to the set or sequence of word vectors representing the input document to compute a set of statistics representing the input document based on the learned topic model; wherein the operations (i) and (ii) are performed by an electronic data processing device. - View Dependent Claims (21, 22, 23, 24)
-
Specification