Document processing employing probabilistic topic modeling of documents represented as text words transformed to a continuous space

US 9,430,563 B2
Filed: 02/02/2012
Issued: 08/30/2016
Est. Priority Date: 02/02/2012
Status: Active Grant

First Claim

Patent Images

1. An apparatus comprising:

an electronic data processing device configured to;

perform a modeling method including;

constructing a set of word embedding transforms by operations including generating a term-document matrix whose elements represent occurrence frequencies for text words in documents of a set of documents and include inverse document frequency (IDF) scaling;

applying the set of word embedding transforms to transform text words of a set of documents into K-dimensional word vectors in order to generate sets or sequences of word vectors representing the documents of the set of documents where K is an integer greater than or equal to two; and

learning a probabilistic topic model comprising a mixture model including M mixture components representing M topics using the sets or sequences of word vectors representing the documents of the set of documents wherein the learned probabilistic topic model operates to assign probabilities for the topics of the probabilistic topic model to an input set or sequence of K-dimensional embedded word vectors; and

perform a document processing method including;

applying the set of word embedding transforms to transform text words of an input document into K-dimensional word vectors in order to generate a set or sequence of word vectors representing the input document; and

applying the learned mixture model to the set or sequence of word vectors representing the input document in order to generate one of (1) a vector or histogram of topic probabilities representing the input document or (2) one or more Fisher vectors representing the input document.

View all claims

6 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A set of word embedding transforms are applied to transform text words of a set of documents into K-dimensional word vectors in order to generate sets or sequences of word vectors representing the documents of the set of documents. A probabilistic topic model is learned using the sets or sequences of word vectors representing the documents of the set of documents. The set of word embedding transforms are applied to transform text words of an input document into K-dimensional word vectors in order to generate a set or sequence of word vectors representing the input document. The learned probabilistic topic model is applied to assign probabilities for topics of the probabilistic topic model to the set or sequence of word vectors representing the input document. A document processing operation such as annotation, classification, or similar document retrieval may be performed using the assigned topic probabilities.

19 Citations

View as Search Results

24 Claims

1. An apparatus comprising:
- an electronic data processing device configured to;
  
  perform a modeling method including;
  
  constructing a set of word embedding transforms by operations including generating a term-document matrix whose elements represent occurrence frequencies for text words in documents of a set of documents and include inverse document frequency (IDF) scaling;
  
  applying the set of word embedding transforms to transform text words of a set of documents into K-dimensional word vectors in order to generate sets or sequences of word vectors representing the documents of the set of documents where K is an integer greater than or equal to two; and
  
  learning a probabilistic topic model comprising a mixture model including M mixture components representing M topics using the sets or sequences of word vectors representing the documents of the set of documents wherein the learned probabilistic topic model operates to assign probabilities for the topics of the probabilistic topic model to an input set or sequence of K-dimensional embedded word vectors; and
  
  perform a document processing method including;
  
  applying the set of word embedding transforms to transform text words of an input document into K-dimensional word vectors in order to generate a set or sequence of word vectors representing the input document; and
  
  applying the learned mixture model to the set or sequence of word vectors representing the input document in order to generate one of (1) a vector or histogram of topic probabilities representing the input document or (2) one or more Fisher vectors representing the input document.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The apparatus of claim 1, wherein the document processing method further includes:
    - annotating the input document with one or more topic labels based on the vector, Fisher vector, or histogram of topic probabilities representing the input document.
  - 3. The apparatus of claim 1, wherein the document processing method further includes:
    - identifying one or more documents other than the input document as being similar to the input document based on the vector, Fisher vector, or histogram of topic probabilities representing the input document.
  - 4. The apparatus of claim 1, wherein the constructing the set of word embedding transforms is by operations further including:
    - applying a dimensionality reduction algorithm to generate K-dimensional word vectors corresponding to the text words of the term-document matrix.
  - 5. The apparatus of claim 4, wherein the dimensionality reduction algorithm is selected from the group consisting of singular value decomposition (SVD), random indexing, and co-clustering.
  - 6. The apparatus of claim 4, wherein:
    - the set of documents used in the constructing of the set of word embedding transforms comprises two aligned corpora including documents in a first natural language and documents in a second natural language,the generating of a term-document matrix comprises generating a first term-document matrix for the documents in the first natural language and generating a second term-document matrix for the documents in the second natural language,the applying of a dimensionality reduction algorithm comprises applying a co-clustering algorithm to both first and second term-document matrices to generate K-dimensional word vectors in the same K-dimensional space corresponding to the text words in both the first and second term-document matrices, andthe learning of a probabilistic topic model comprises learning a probabilistic model using sets or sequences of word vectors representing a set of documents of the first language but not the second language, wherein the learned probabilistic topic model operates to assign probabilities for topics of the probabilistic topic model to an input set or sequence of K-dimensional embedded word vectors generated by applying the set of word embedding transforms to a document in the first language, or the second language, or a combination of the first and second languages.
  - 7. The apparatus of claim 1, wherein the modeling method further includes extending the set of word embedding transforms by identifying one or more related words of a new word and generating a K-dimensional word vector for the new word by aggregating the K-dimensional word vectors generated for the identified one or more related words by the set of word embedding transforms.
  - 8. The apparatus of claim 1, wherein the set of word embedding transforms comprise a data structure associating text with corresponding K-dimensional word vector.
  - 9. The apparatus of claim 1, wherein the probabilistic topic model comprises a Gaussian mixture model (GMM) including K-dimensional Gaussian components, the GMM including M Gaussian components corresponding to the M semantic topics.
  - 10. The apparatus of claim 9, wherein in the performed document processing method, the applying includes:
    - applying the learned GMM to generate a vector or histogram of topic probabilities representing the input document.
  - 11. The apparatus of claim 9, wherein in the performed document processing method, the applying includes:
    - applying the learned GMM to generate one or more Fisher Vectors representing the input document.

12. A non-transitory storage medium storing instructions executable by an electronic data processing device to perform operations including:
- constructing a set of word embedding transforms by applying a dimensionality reduction algorithm to a term-document matrix constructed form a set of training documents and having matrix elements storing word frequencies scaled by a metric indicative of the frequencies of occurrence of the words in the set of documents to generate K-dimensional word vectors corresponding to the text words of the term-document matrix;
  
  performing a modeling method including;
  
  applying the set of word embedding transforms to transform text words of a set of documents into K-dimensional word vectors in order to generate the sets or sequences of word vectors representing the documents where K is an integer greater than or equal to two; and
  
  learning a probabilistic topic model comprising M components representing M topics using the sets or sequences of word vectors representing the documents of the set of documents wherein the learned probabilistic topic model operates to assign probabilities for the topics of the probabilistic topic model to an input set or sequence of K-dimensional embedded word vectors; and
  
  performing a document processing method including;
  
  applying the set of word embedding transforms to transform text words of an input document into corresponding K-dimensional word vector in order to generate a set or sequence of word vectors representing the input document; and
  
  applying the learned probabilistic topic model to the set or sequence of word vectors representing the input document in order to generate a topic model representation of the input document comprising one of (1) a vector or histogram of topic probabilities representing the input document or (2) one or more Fisher vectors representing the input document.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
- - 13. The non-transitory storage medium of claim 12, wherein the topic model comprises a Gaussian mixture model (GMM) including M Gaussian components corresponding to the M topics of the topic model.
  - 14. The non-transitory storage medium of claim 12, wherein the topic model representation of the input document comprises a vector or histogram of topic probabilities representing the input document.
  - 15. The non-transitory storage medium of claim 12, wherein the topic model representation of the input document comprises one or more Fisher Vectors representing the input document.
  - 16. The non-transitory storage medium of claim 12, wherein the the performing of a document processing method further includes annotating the input document with one or more topic labels based on the topic model representation of the input document.
  - 17. The non-transitory storage medium of claim 12, wherein the performing of a document processing method further includes identifying one or more similar documents having topic model representations that are similar to the topic model representation of the input document.
  - 18. The non-transitory storage medium of claim 12, wherein the stored instructions are executable by an electronic data processing device to perform the further operation of constructing the set of word embedding transforms as a data structure associating text words with corresponding K-dimensional word vectors.
  - 19. The non-transitory storage medium of claim 12, wherein the stored instructions are executable by an electronic data processing device to perform the further operation of constructing the set of word embedding transforms as a data structure associating text words with corresponding K-dimensional word vectors.

20. A method comprising:
- (i) generating a probabilistic topic model comprising M topics by operations including;
  
  constructing a set of word embedding transforms used in the embedding operation by applying a dimensionality reduction algorithm to a term frequency-inverse document frequency matrix constructed form a set of training documents to generate K-dimensional word vectors corresponding to the text words of the term-document matrix;
  
  applying the set of word embedding transforms to transform text words of a set of documents into K dimensional word vectors in order to generate sets or sequences of word vectors representing the documents of the set of documents where K is an integer greater than or equal to two; and
  
  learning the probabilistic topic model using the sets or sequences of word vectors representing the documents of the set of documents wherein the learned probabilistic topic model operates to assign probabilities for the topics of the probabilistic topic model to an input set or sequence of K-dimensional embedded word vectors; and
  
  (ii) processing an input document by operations including;
  
  applying the set of word embedding transforms to transform text words of the input document into corresponding K-dimensional word vectors in order to generate a set or sequence of word vectors representing the input document; and
  
  applying the learned probabilistic topic model to the set or sequence of word vectors representing the input document to compute a set of statistics representing the input document based on the learned topic model;
  
  wherein the operations (i) and (ii) are performed by an electronic data processing device.
- View Dependent Claims (21, 22, 23, 24)
- - 21. The method of claim 20, wherein the probabilistic topic model comprises a Gaussian mixture model (GMM) including M Gaussian components corresponding to the M topics of the probabilistic topic model.
  - 22. The method of claim 20, further comprising:
    - annotating the input document with one or more topic labels based on the computed statistics representing the input document;
      
      wherein the annotating is performed by an electronic data processing device.
  - 23. The method of claim 20, further comprising:
    - identifying one or more similar documents that are similar to the input document as measured by the computed statistics representing the input document;
      
      wherein the identifying is performed by an electronic data processing device.
  - 24. The method of claim 20, wherein the computed statistics representing the input document comprise one or more Fisher Vectors representing topic probabilities for the input document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Xerox Corporation (Xerox Holdings Corp.)
Original Assignee
Xerox Corporation (Xerox Holdings Corp.)
Inventors
Clinchant, Stphane, Perronnin, Florent
Primary Examiner(s)
PYO, MONICA M

Application Number

US13/364,535
Publication Number

US 20130204885A1
Time in Patent Office

1,671 Days
Field of Search

707/706, 707/728, 707/748, 707/750, 707/756, 707/760
US Class Current

1/1
CPC Class Codes

G06F 16/355   Class or cluster creation o...

G06F 18/213   Feature extraction, e.g. by...

G06F 18/2415   based on parametric or prob...

G06V 30/10   Character recognition

G06V 30/18152   Extracting features based o...

Document processing employing probabilistic topic modeling of documents represented as text words transformed to a continuous space

First Claim

6 Assignments

0 Petitions

Accused Products

Abstract

19 Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Document processing employing probabilistic topic modeling of documents represented as text words transformed to a continuous space

First Claim

6 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

19 Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links