PARALLEL DOCUMENT MINING

US 20120047172A1
Filed: 08/22/2011
Published: 02/23/2012
Est. Priority Date: 08/23/2010
Status: Abandoned Application

First Claim

Patent Images

1. A computer-implemented method comprising:

extracting, using one or more processors, a plurality of matching features and a plurality of scoring features from a collection of documents in multiple languages;

generating a forward index based on the plurality of scoring features, the forward index comprising one or more scoring feature lists containing at least one scoring feature extracted from the documents in the collection;

generating an inverted index based on the plurality of matching features, the inverted index comprising one or more matching document lists, where each matching document list identifies a group of matching documents from the collection that share a corresponding matching feature;

generating, for each matching document list in the inverted index, a corresponding plurality of matching document pairs;

calculating, for each matching document pair, a score based on information from the forward index; and

determining, based on the score of each matching document pair, whether each matching document pair contains a first matching document and a second matching document that is a translation of the first matching document.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A technique includes providing a collection of documents in multiple languages, identifying, from the collection of documents, a group of candidate documents, where each candidate document in the group shares multiple corresponding rare features, evaluating pairs of candidate documents in the group using multiple common features present in the collection of documents, and determining, based on evaluating the pairs of candidate documents, whether each pair of candidate documents corresponds to a translated pair of documents.

Citations

20 Claims

1. A computer-implemented method comprising:
- extracting, using one or more processors, a plurality of matching features and a plurality of scoring features from a collection of documents in multiple languages;
  
  generating a forward index based on the plurality of scoring features, the forward index comprising one or more scoring feature lists containing at least one scoring feature extracted from the documents in the collection;
  
  generating an inverted index based on the plurality of matching features, the inverted index comprising one or more matching document lists, where each matching document list identifies a group of matching documents from the collection that share a corresponding matching feature;
  
  generating, for each matching document list in the inverted index, a corresponding plurality of matching document pairs;
  
  calculating, for each matching document pair, a score based on information from the forward index; and
  
  determining, based on the score of each matching document pair, whether each matching document pair contains a first matching document and a second matching document that is a translation of the first matching document.
- View Dependent Claims (2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13)
- - 2. The method of claim 1, where the matching features occur less frequently in the collection of documents than the scoring features.
  - 3. The method of claim 1, further comprising translating the collection of documents in multiple languages into a collection of documents in a single language.
  - 4. The method of claim 1, where each one or more scoring feature list is indexed by a different corresponding document in the collection.
  - 5. The method of claim 1, where each matching document list is indexed by the corresponding matching feature.
  - 6. The method of claim 1, where calculating the score based on information from the forward index comprises calculating a cosine similarity between a first scoring feature list corresponding to a first matching document in the matching document pair and a second scoring feature list corresponding to a second matching document in the matching document pair.
  - 8. The method of claim 1, where providing the collection of documents in multiple languages comprises translating one or more of the documents into a single language.
  - 9. The method of claim 1, where each rare feature is a feature likely to occur in at least one translated document and at least one other document in the collection of documents.
  - 10. The method of claim 9, where each common feature is a feature that is more likely to occur in the collection of documents than any one of the rare features in the collection of documents.
  - 11. The method of claim 1, where the plurality of corresponding rare features or the plurality of common features comprises portions of text extracted from the collection of documents.
  - 12. The method of claim 1, where the plurality of corresponding rare features or the plurality of common features comprises a plurality of n-grams.
  - 13. The method of claim 1, where evaluating the pairs of candidate documents includes scoring each pair of candidate documents based on at least some of the multiple common features to obtain a candidate pair score, and where determining whether each pair of candidate documents corresponds to a translated pair of documents includes discarding one or more pairs of candidate documents having a candidate pair score below a threshold value.

7. A method comprising:
- providing a collection of documents in multiple languages;
  
  identifying, from the collection of documents, a group of candidate documents, where each candidate document in the group shares a plurality of corresponding rare features having a low frequency of occurrence in the collection of documents;
  
  evaluating, using one or more processors, pairs of candidate documents in the group using a plurality of common features present in the collection of documents, the common features having a frequency of occurrence in the collection of documents that is higher than the rare features; and
  
  determining, based on evaluating the pairs of candidate documents, whether each pair of candidate documents corresponds to a translated pair of documents.

14. A system comprising:
- one or more processors and memory operable to interact to perform operations including;
  
  providing a collection of documents in multiple languages;
  
  identifying, from the collection of documents, a group of candidate documents, where each candidate document in the group shares a plurality of corresponding rare features having a low frequency of occurrence in the collection of documents;
  
  evaluating pairs of candidate documents in the group using a plurality of common features present in the collection of documents, the common features having a frequency of occurrence in the collection of documents that is higher than the rare features; and
  
  determining, based on evaluating the pairs of candidate documents, whether each pair of candidate documents corresponds to a translated pair of documents.
- View Dependent Claims (15, 16, 17, 18, 19, 20)
- - 15. The system of claim 14, where providing the collection of documents further comprises translating one or more of the documents in multiple languages into a single language.
  - 16. The system of claim 14, where each rare feature is a feature likely to occur in at least one translated document and at least one other document in the collection of documents.
  - 17. The method of claim 16, where each common feature is a feature that is more likely to occur in the collection of documents than any one of the rare features in the collection of documents.
  - 18. The system of claim 14, where the plurality of corresponding rare features or the plurality of common features comprises portions of text extracted from the collection of documents.
  - 19. The system of claim 14, where the plurality of corresponding rare features or the plurality of common features comprises a plurality of n-grams.
  - 20. The system of claim 14, where evaluating the pairs of candidate documents comprises scoring each pair of candidate documents based on at least some of the common features to obtain a candidate pair score, and where determining whether each pair of candidate documents corresponds to a translated pair of documents includes discarding one or more pairs of candidate documents having a candidate pair score below a threshold value.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Ponte, Jay M., Popat, Ashok C., Uszkoreit, Jakob, Dubiner, Moshe

Application Number

US13/214,941
Publication Number

US 20120047172A1
Time in Patent Office

Days
Field of Search
US Class Current

707/776
CPC Class Codes

G06F 16/30 of unstructured textual dat...

G06F 40/45 Example-based machine trans...

PARALLEL DOCUMENT MINING

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

PARALLEL DOCUMENT MINING

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links