PARALLEL DOCUMENT MINING
First Claim
Patent Images
1. A computer-implemented method comprising:
- extracting, using one or more processors, a plurality of matching features and a plurality of scoring features from a collection of documents in multiple languages;
generating a forward index based on the plurality of scoring features, the forward index comprising one or more scoring feature lists containing at least one scoring feature extracted from the documents in the collection;
generating an inverted index based on the plurality of matching features, the inverted index comprising one or more matching document lists, where each matching document list identifies a group of matching documents from the collection that share a corresponding matching feature;
generating, for each matching document list in the inverted index, a corresponding plurality of matching document pairs;
calculating, for each matching document pair, a score based on information from the forward index; and
determining, based on the score of each matching document pair, whether each matching document pair contains a first matching document and a second matching document that is a translation of the first matching document.
2 Assignments
0 Petitions
Accused Products
Abstract
A technique includes providing a collection of documents in multiple languages, identifying, from the collection of documents, a group of candidate documents, where each candidate document in the group shares multiple corresponding rare features, evaluating pairs of candidate documents in the group using multiple common features present in the collection of documents, and determining, based on evaluating the pairs of candidate documents, whether each pair of candidate documents corresponds to a translated pair of documents.
-
Citations
20 Claims
-
1. A computer-implemented method comprising:
-
extracting, using one or more processors, a plurality of matching features and a plurality of scoring features from a collection of documents in multiple languages; generating a forward index based on the plurality of scoring features, the forward index comprising one or more scoring feature lists containing at least one scoring feature extracted from the documents in the collection; generating an inverted index based on the plurality of matching features, the inverted index comprising one or more matching document lists, where each matching document list identifies a group of matching documents from the collection that share a corresponding matching feature; generating, for each matching document list in the inverted index, a corresponding plurality of matching document pairs; calculating, for each matching document pair, a score based on information from the forward index; and determining, based on the score of each matching document pair, whether each matching document pair contains a first matching document and a second matching document that is a translation of the first matching document. - View Dependent Claims (2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13)
-
-
7. A method comprising:
-
providing a collection of documents in multiple languages; identifying, from the collection of documents, a group of candidate documents, where each candidate document in the group shares a plurality of corresponding rare features having a low frequency of occurrence in the collection of documents; evaluating, using one or more processors, pairs of candidate documents in the group using a plurality of common features present in the collection of documents, the common features having a frequency of occurrence in the collection of documents that is higher than the rare features; and determining, based on evaluating the pairs of candidate documents, whether each pair of candidate documents corresponds to a translated pair of documents.
-
-
14. A system comprising:
one or more processors and memory operable to interact to perform operations including; providing a collection of documents in multiple languages; identifying, from the collection of documents, a group of candidate documents, where each candidate document in the group shares a plurality of corresponding rare features having a low frequency of occurrence in the collection of documents; evaluating pairs of candidate documents in the group using a plurality of common features present in the collection of documents, the common features having a frequency of occurrence in the collection of documents that is higher than the rare features; and determining, based on evaluating the pairs of candidate documents, whether each pair of candidate documents corresponds to a translated pair of documents. - View Dependent Claims (15, 16, 17, 18, 19, 20)
Specification