×

Phrase-based detection of duplicate documents in an information retrieval system

  • US 8,108,412 B2
  • Filed: 03/04/2010
  • Issued: 01/31/2012
  • Est. Priority Date: 07/26/2004
  • Status: Active Grant
First Claim
Patent Images

1. A method of detecting duplicate documents in search results, the method comprising:

  • receiving a query comprising at least one phrase;

    retrieving a plurality of documents responsive to the query to form a search result, the retrieved documents being selected from a set of documents;

    for each of the retrieved documents, generating, by operation of a processor within a computer system, a document description comprising selected sentences of the document, wherein the selected sentences are ordered in the document description as a function of a number of related phrases in each selected sentence, wherein a phrase gj is a related phrase of another phrase gk occurring in the set of documents when an information gain of gj with respect to gk exceeds a predetermined threshold;

    responsive to the document description of at least two documents matching, discarding at least one of the two documents from the search result.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×