×

Phrase extraction using subphrase scoring

  • US 8,402,033 B1
  • Filed: 10/14/2011
  • Issued: 03/19/2013
  • Est. Priority Date: 03/30/2007
  • Status: Active Grant
First Claim
Patent Images

1. A computer implemented method of extracting a set of valid phrases from a plurality of documents, the method comprising:

  • for each document;

    identifying a plurality of candidate phrases contained in the document, wherein a candidate phrase includes multiple consecutive words that appear in the document;

    scoring candidate phrases in the document to produce document phrase scores for the candidate phrases for the document, the document phrase scores for a candidate phrase being based on instances of the candidate phrase that appear in the document,wherein scoring a candidate phrase in the document to produce a document phrase score includes;

    determining for the candidate phrase two or more subphrases within the candidate phrase, wherein a subphrase contains two or more words; and

    scoring each determined subphrase in the document as a function of the position of the subphrase relative to a sequence of words containing the candidate phrase;

    for at least one of the candidate phrases;

    creating, via a processor, a combined score for the candidate phrase based on a plurality of different document phrase scores for the candidate phrase for respective different documents; and

    determining whether the candidate phrase is a valid phrase based on the combined score for the candidate phrase and based on the document phrase scores for the candidate phrase.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×