×

Phrase-based detection of duplicate documents in an information retrieval system

  • US 8,489,628 B2
  • Filed: 12/01/2011
  • Issued: 07/16/2013
  • Est. Priority Date: 07/26/2004
  • Status: Active Grant
First Claim
Patent Images

1. A computer-implemented method of selecting documents from a document collection in response to a query, the method comprising:

  • receiving a query;

    identifying, by at least one processor of a computing system, an incomplete phrase in the query, wherein phrases predicted by the incomplete phrase in the document collection include only phrase extensions of the incomplete phrase;

    identifying, by at least one processor of the computing system, a phrase extension of the incomplete phrase, wherein the phrase extension of the incomplete phrase is a sequence of words that begins with the incomplete phrase but that is longer than the incomplete phrase, and wherein the incomplete phrase predicts the phrase extension based an actual co-occurrence rate of the phrase extension and the incomplete phrase in the document collection exceeding an expected co-occurrence rate of the phrase extension and the incomplete phrase in the document collection; and

    selecting, by at least one processor of the computing system, documents from the document collection, wherein the selected documents contain the phrase extension.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×