×

PARALLEL DOCUMENT MINING

  • US 20120047172A1
  • Filed: 08/22/2011
  • Published: 02/23/2012
  • Est. Priority Date: 08/23/2010
  • Status: Abandoned Application
First Claim
Patent Images

1. A computer-implemented method comprising:

  • extracting, using one or more processors, a plurality of matching features and a plurality of scoring features from a collection of documents in multiple languages;

    generating a forward index based on the plurality of scoring features, the forward index comprising one or more scoring feature lists containing at least one scoring feature extracted from the documents in the collection;

    generating an inverted index based on the plurality of matching features, the inverted index comprising one or more matching document lists, where each matching document list identifies a group of matching documents from the collection that share a corresponding matching feature;

    generating, for each matching document list in the inverted index, a corresponding plurality of matching document pairs;

    calculating, for each matching document pair, a score based on information from the forward index; and

    determining, based on the score of each matching document pair, whether each matching document pair contains a first matching document and a second matching document that is a translation of the first matching document.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×