Trans-lingual representation of text documents

US 8,738,354 B2
Filed: 06/19/2009
Issued: 05/27/2014
Est. Priority Date: 06/19/2009
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

accepting first language data, wherein the first language data comprises first documents in a first language and the first documents are associated with multiple topics;

accepting second language data, wherein the second language data comprises second documents in a second language that is different than the first language, wherein the second documents in the second language are also associated with at least some of the multiple topics and the first language data and second language data collectively comprise pairs of documents that are on the same topic;

obtaining a first document-term matrix from the first language data, wherein the first document-term matrix comprises a plurality of first rows and different first rows of the first document-term matrix correspond to different first documents in the first language;

obtaining a second document-term matrix from the second language data, wherein the second document-term matrix comprises a plurality of second rows and different second rows of the second document-term matrix correspond to different second documents in the second language; and

applying an algorithm to the first document-term matrix to produce a first stored matrix for the first language and to the second document-term matrix to produce a second stored matrix for the second language,wherein;

multiplying the first stored matrix by the first document-term matrix produces a plurality of first translingual text representation vectors,multiplying the second stored matrix by the second document-term matrix produces a plurality of second translingual text representation vectors, andapplying the algorithm comprises adjusting the first stored matrix and the second stored matrix to thereby reduce distances between individual first translingual text representation vectors and individual second translingual text representation vectors for the pairs of documents that are on the same topic,wherein at least the applying the algorithm is performed by a computer.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of creating translingual text representations takes in documents in a first language and in a second language and creates a matrix using the words in the documents to represent which words are present in which language. An algorithm is applied to each matrix such that like documents are placed close to each other and unlike documents are moved far from each other.

Citations

21 Claims

1. A method comprising:
- accepting first language data, wherein the first language data comprises first documents in a first language and the first documents are associated with multiple topics;
  
  accepting second language data, wherein the second language data comprises second documents in a second language that is different than the first language, wherein the second documents in the second language are also associated with at least some of the multiple topics and the first language data and second language data collectively comprise pairs of documents that are on the same topic;
  
  obtaining a first document-term matrix from the first language data, wherein the first document-term matrix comprises a plurality of first rows and different first rows of the first document-term matrix correspond to different first documents in the first language;
  
  obtaining a second document-term matrix from the second language data, wherein the second document-term matrix comprises a plurality of second rows and different second rows of the second document-term matrix correspond to different second documents in the second language; and
  
  applying an algorithm to the first document-term matrix to produce a first stored matrix for the first language and to the second document-term matrix to produce a second stored matrix for the second language,wherein;
  
  multiplying the first stored matrix by the first document-term matrix produces a plurality of first translingual text representation vectors,multiplying the second stored matrix by the second document-term matrix produces a plurality of second translingual text representation vectors, andapplying the algorithm comprises adjusting the first stored matrix and the second stored matrix to thereby reduce distances between individual first translingual text representation vectors and individual second translingual text representation vectors for the pairs of documents that are on the same topic,wherein at least the applying the algorithm is performed by a computer.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein the algorithm is an oriented principal component analysis algorithm.
  - 3. The method of claim 2, further comprising:
    - splitting the first document-term matrix into N non-overlapping first document-term submatrices for the first language;
      
      splitting the second document-term matrix into N non-overlapping second document-term submatrices for the second language;
      
      for each value i of N;
      
      applying the algorithm to the ith first document-term submatrix for the first language and the ith second document-term submatrix for the second language to create an ith set of stored matrices;
      
      storing the ith set of stored matrices that in an ith instance of the algorithm;
      
      applying the ith set of stored matrices to the ith first document-term submatrix of the first language to create ith first translingual text representation (TTR) vectors in the first language;
      
      accumulating each of the ith first TTR vectors as rows in an ith first TTR submatrix for the first language; and
      
      applying the ith set of stored matrices to the ith second document-term submatrix of the second language, to produce an ith second TTR submatrix for the second language; and
      
      by appending columns together;
      
      combining the first TTR submatrices for the first language into a first TTR matrix for the first language, andcombining the second TTR submatrices for the second language into a second TTR matrix for the second language.
  - 4. The method of claim 1, wherein the algorithm comprises a network training algorithm.
  - 5. The method of claim 1, wherein first entries in the first document-term matrix identify a number of times that first language terms appear in the first language data and wherein second entries in the second document-term matrix identify a number of times that second language terms appear in the second language data.
  - 6. The method of claim 1, wherein applying the algorithm further comprises adjusting the first stored matrix and the second stored matrix such that other distances are increased between other first translingual text representation vectors and other second translingual text representation vectors for other pairs of documents that are not on the same topic.
  - 7. The method of claim 6, wherein applying the algorithm minimizes the distances and maximizes the other distances.
  - 8. The method of claim 1, wherein first entries in the first document-term matrix are binary indicators of whether first language terms appear in the first language data and wherein second entries in the second document-term matrix are other binary indicators of whether second language terms appear in the second language data.

9. A computer memory device or storage device comprising computer executable instructions which, when executed by a processing unit of a computing device, cause the processing unit to perform acts comprising:
- accepting first language data, wherein the first language data comprises first documents in a first language and the first documents are associated with multiple topics;
  
  accepting second language data, wherein the second language data comprises second documents in a second language that is different than the first language, wherein the second documents in the second language are also associated with at least some of the multiple topics and the first language data and second language data collectively comprise pairs of documents that are on the same topic;
  
  obtaining a first document-term representation from the first language data, wherein the first document-term representation comprises a plurality of first components corresponding to different first documents in the first language;
  
  obtaining a second document-term representation from the second language data, wherein the second document-term representation comprises a plurality of second components corresponding to different second documents in the second language; and
  
  applying an algorithm to the first document-term representation and the second document-term representation to produce first translingual text representations and second translingual text representations, wherein the algorithm comprises;
  
  multiplying a first stored matrix by the first document-term representation to produce the first translingual text representations;
  
  multiplying a second stored matrix by the second document-term representation to produce the second translingual text representations; and
  
  reducing distances between individual first translingual text representations and individual second translingual text representations for the pairs of documents that are on the same topic by altering the first stored matrix and the second stored matrix.
- View Dependent Claims (10, 11, 12, 13, 14, 15)
- - 10. The computer memory device or storage device of claim 9, wherein the algorithm is an oriented principal component analysis algorithm.
  - 11. The computer memory device or storage device of claim 9, wherein the algorithm comprises a network training algorithm.
  - 12. The computer memory device or storage device of claim 9, wherein first entries in the first components of the first document-term representation identify a number of times that first language terms appear in the first language data and wherein second entries in the second components of the second document-term representation identify a number of times that second language terms appear in the second language data.
  - 13. The computer memory device or storage device of claim 9, wherein applying the algorithm further comprises increasing other distances between other first translingual text representations and other second translingual text representations for other pairs of documents that are not on the same topic by altering the first stored matrix and the second stored matrix.
  - 14. The computer memory device or storage device of claim 13, wherein applying the algorithm minimizes the distances and maximizes the other distances.
  - 15. The computer memory device or storage device of claim 9, wherein the first representation comprises a first document-term matrix and the second representation comprises a second document-term matrix.

16. A computer system comprising:
- a memory comprising computer executable instructions; and
  
  a processing unit configured to execute the computer executable instructions, wherein the computer executable instructions configure the processing unit to;
  
  accept first language data, wherein the first language data comprises first documents in a first language and the first documents are associated with multiple topics;
  
  accept second language data, wherein the second language data comprises second documents in a second language that is different than the first language, wherein the second documents in the second language are also associated with at least some of the multiple topics and the first language data and second language data collectively comprise pairs of documents that are on the same topic;
  
  obtain a plurality of first rows from the first language data, wherein different first rows correspond to different first documents in the first language;
  
  obtain a plurality of second rows from the second language data, wherein different second rows correspond to different second documents in the second language; and
  
  apply an algorithm to the plurality of first rows and the plurality of second rows to produce first translingual text representations and second translingual text representations, wherein the algorithm comprises;
  
  training a first neural network on the plurality of first rows and outputting, from the first neural network, the first translingual text representations,training a second neural network on the plurality of second rows and outputting, from the second neural network, the second translingual text representations, andadjusting parameters of the first neural network and the second neural network such that distances are reduced between individual first translingual text representations and individual second translingual text representations for the pairs of documents that are on the same topic.
- View Dependent Claims (17, 18, 19, 20, 21)
- - 17. The computer system of claim 16, wherein the plurality of first rows are part of a first document-term matrix and the plurality of second rows are part of a second document-term matrix.
  - 18. The computer system of claim 16, wherein the algorithm comprises a Siamese network training algorithm.
  - 19. The computer system of claim 16, wherein first entries in the plurality of first rows identify a number of times that first language terms appear in the first language data and wherein second entries in the plurality of second rows identify a number of times that second language terms appear in the second language data.
  - 20. The computer system of claim 16, wherein applying the algorithm further comprises adjusting the parameters to increase other distances between other first translingual text representations and other second translingual text representations for other pairs of documents that are not on the same topic.
  - 21. The computer system of claim 20, wherein the algorithm is configured to minimize the distances and maximize the other distances.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Platt, John, Sutskever, Ilya
Primary Examiner(s)
Dorvil, Richemond
Assistant Examiner(s)
ADESANYA, OLUJIMI A

Application Number

US12/488,422
Publication Number

US 20100324883A1
Time in Patent Office

1,803 Days
Field of Search

704/2, 704/8, 704/9
US Class Current

704/2
CPC Class Codes

G06F 40/40 Processing or translation o...

G06N 3/045 Combinations of networks

Trans-lingual representation of text documents

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Trans-lingual representation of text documents

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links