×

Bifurcated document relevance scoring

  • US 8,086,594 B1
  • Filed: 03/30/2007
  • Issued: 12/27/2011
  • Est. Priority Date: 03/30/2007
  • Status: Active Grant
First Claim
Patent Images

1. A computer implemented method for bifurcated document relevance scoring of documents in a document collection, the method comprising:

  • indexing a plurality of documents in the document collection by;

    providing a set of phrases;

    for a plurality of documents in the document collection;

    identifying a plurality of phrases from the set of phrases that occurs in the document;

    for each phrase in a plurality of the identified phrases, scoring the phrase to produce a phrase relevance score for the phrase with respect to the document, and storing the phrase relevance score for the document in a phrase posting list for the phrase;

    receiving a search query of three or more words;

    determining a set of valid phrases in the search query by;

    decomposing, by at least one processor of a computer system, the query into a plurality of candidate phrasifications, including different groupings of words of the query, each candidate phrasification comprising a disjoint union of component phrases, and each component phrase including at least one word or related word of the query;

    scoring, by at least one of the processors of the computer system, at least two candidate phrasifications, wherein the candidate phrasifications include one or more component phrases, wherein the scoring is based on a probability of occurrence of each of the candidate phrasification'"'"'s component phrases in a corpus of documents and is based on the number of component phrases constituting the candidate phrasification, wherein candidate phrasifications having relatively fewer component phrases are weighted higher than candidate phrasifications having relatively more component phrases;

    comparing, by at least one of the processors of the computer system, a score for each scored candidate phrasification to a threshold value; and

    selecting, by at least one of the processors of the computer system, at least one candidate phrasification, wherein the scores of each selected candidate phrasification exceed a threshold value and identifying the component phrase(s) of the selected candidate phrasification(s) as valid phrases for the search query;

    for each valid phrase for the search query, obtaining from the phrase posting list for the valid phrase the phrase relevance score for documents in which the valid phrase occurs; and

    for documents in which a valid phrase of the query occurs, scoring the document to produce a final relevance score using the phrase relevance scores for the document and based on the valid phrases of the search query.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×