×

Phrase identification in an information retrieval system

  • US 7,580,921 B2
  • Filed: 07/26/2004
  • Issued: 08/25/2009
  • Est. Priority Date: 07/26/2004
  • Status: Active Grant
First Claim
Patent Images

1. A computer implemented method for identifying phrases in a document collection, the method comprising:

  • collecting possible phrases from documents in the document collection;

    classifying individual possible phrases as either a good phrase or a bad phrase according to a frequency of occurrence of the individual possible phrase;

    determining, for a pair of good phrases gj and gk in the document collection, an information gain of gk with respect to gj as a function of an actual co-occurrence rate of gj and gk and an expected co-occurrence rate of gj and gk in the document collection;

    selectively retaining as valid phrases those good phrases that, predict the occurrence of at least one other good phrase in the document collection, where a good phrase gj predicts the occurrence of another good phrase gk in the document collection when the determined information gain of gk in the presence of gj exceeds a first predetermined threshold;

    identifying, for a plurality of selectively retained valid phrases gx, a phrase gy as a related phrase of gx where the information gain of gy in the presence of gx exceeds a second predetermined threshold that is more restrictive than the first predetermined threshold; and

    storing the valid phrases and identified related phrases on a computer-readable storage medium.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×