Trans-lingual representation of text documents
First Claim
Patent Images
1. A method comprising:
- accepting first language data, wherein the first language data comprises first documents in a first language and the first documents are associated with multiple topics;
accepting second language data, wherein the second language data comprises second documents in a second language that is different than the first language, wherein the second documents in the second language are also associated with at least some of the multiple topics and the first language data and second language data collectively comprise pairs of documents that are on the same topic;
obtaining a first document-term matrix from the first language data, wherein the first document-term matrix comprises a plurality of first rows and different first rows of the first document-term matrix correspond to different first documents in the first language;
obtaining a second document-term matrix from the second language data, wherein the second document-term matrix comprises a plurality of second rows and different second rows of the second document-term matrix correspond to different second documents in the second language; and
applying an algorithm to the first document-term matrix to produce a first stored matrix for the first language and to the second document-term matrix to produce a second stored matrix for the second language,wherein;
multiplying the first stored matrix by the first document-term matrix produces a plurality of first translingual text representation vectors,multiplying the second stored matrix by the second document-term matrix produces a plurality of second translingual text representation vectors, andapplying the algorithm comprises adjusting the first stored matrix and the second stored matrix to thereby reduce distances between individual first translingual text representation vectors and individual second translingual text representation vectors for the pairs of documents that are on the same topic,wherein at least the applying the algorithm is performed by a computer.
2 Assignments
0 Petitions
Accused Products
Abstract
A method of creating translingual text representations takes in documents in a first language and in a second language and creates a matrix using the words in the documents to represent which words are present in which language. An algorithm is applied to each matrix such that like documents are placed close to each other and unlike documents are moved far from each other.
-
Citations
21 Claims
-
1. A method comprising:
-
accepting first language data, wherein the first language data comprises first documents in a first language and the first documents are associated with multiple topics; accepting second language data, wherein the second language data comprises second documents in a second language that is different than the first language, wherein the second documents in the second language are also associated with at least some of the multiple topics and the first language data and second language data collectively comprise pairs of documents that are on the same topic; obtaining a first document-term matrix from the first language data, wherein the first document-term matrix comprises a plurality of first rows and different first rows of the first document-term matrix correspond to different first documents in the first language; obtaining a second document-term matrix from the second language data, wherein the second document-term matrix comprises a plurality of second rows and different second rows of the second document-term matrix correspond to different second documents in the second language; and applying an algorithm to the first document-term matrix to produce a first stored matrix for the first language and to the second document-term matrix to produce a second stored matrix for the second language, wherein; multiplying the first stored matrix by the first document-term matrix produces a plurality of first translingual text representation vectors, multiplying the second stored matrix by the second document-term matrix produces a plurality of second translingual text representation vectors, and applying the algorithm comprises adjusting the first stored matrix and the second stored matrix to thereby reduce distances between individual first translingual text representation vectors and individual second translingual text representation vectors for the pairs of documents that are on the same topic, wherein at least the applying the algorithm is performed by a computer. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computer memory device or storage device comprising computer executable instructions which, when executed by a processing unit of a computing device, cause the processing unit to perform acts comprising:
-
accepting first language data, wherein the first language data comprises first documents in a first language and the first documents are associated with multiple topics; accepting second language data, wherein the second language data comprises second documents in a second language that is different than the first language, wherein the second documents in the second language are also associated with at least some of the multiple topics and the first language data and second language data collectively comprise pairs of documents that are on the same topic; obtaining a first document-term representation from the first language data, wherein the first document-term representation comprises a plurality of first components corresponding to different first documents in the first language; obtaining a second document-term representation from the second language data, wherein the second document-term representation comprises a plurality of second components corresponding to different second documents in the second language; and applying an algorithm to the first document-term representation and the second document-term representation to produce first translingual text representations and second translingual text representations, wherein the algorithm comprises; multiplying a first stored matrix by the first document-term representation to produce the first translingual text representations; multiplying a second stored matrix by the second document-term representation to produce the second translingual text representations; and reducing distances between individual first translingual text representations and individual second translingual text representations for the pairs of documents that are on the same topic by altering the first stored matrix and the second stored matrix. - View Dependent Claims (10, 11, 12, 13, 14, 15)
-
-
16. A computer system comprising:
-
a memory comprising computer executable instructions; and a processing unit configured to execute the computer executable instructions, wherein the computer executable instructions configure the processing unit to; accept first language data, wherein the first language data comprises first documents in a first language and the first documents are associated with multiple topics; accept second language data, wherein the second language data comprises second documents in a second language that is different than the first language, wherein the second documents in the second language are also associated with at least some of the multiple topics and the first language data and second language data collectively comprise pairs of documents that are on the same topic; obtain a plurality of first rows from the first language data, wherein different first rows correspond to different first documents in the first language; obtain a plurality of second rows from the second language data, wherein different second rows correspond to different second documents in the second language; and apply an algorithm to the plurality of first rows and the plurality of second rows to produce first translingual text representations and second translingual text representations, wherein the algorithm comprises; training a first neural network on the plurality of first rows and outputting, from the first neural network, the first translingual text representations, training a second neural network on the plurality of second rows and outputting, from the second neural network, the second translingual text representations, and adjusting parameters of the first neural network and the second neural network such that distances are reduced between individual first translingual text representations and individual second translingual text representations for the pairs of documents that are on the same topic. - View Dependent Claims (17, 18, 19, 20, 21)
-
Specification