Phrase identification in an information retrieval system
First Claim
Patent Images
1. A computer implemented method for identifying phrases in a document collection, the method comprising:
- collecting possible phrases from documents in the document collection;
classifying individual possible phrases as either a good phrase or a bad phrase according to a frequency of occurrence of the individual possible phrase;
determining, for a pair of good phrases gj and gk in the document collection, an information gain of gk with respect to gj as a function of an actual co-occurrence rate of gj and gk and an expected co-occurrence rate of gj and gk in the document collection;
selectively retaining as valid phrases those good phrases that, predict the occurrence of at least one other good phrase in the document collection, where a good phrase gj predicts the occurrence of another good phrase gk in the document collection when the determined information gain of gk in the presence of gj exceeds a first predetermined threshold;
identifying, for a plurality of selectively retained valid phrases gx, a phrase gy as a related phrase of gx where the information gain of gy in the presence of gx exceeds a second predetermined threshold that is more restrictive than the first predetermined threshold; and
storing the valid phrases and identified related phrases on a computer-readable storage medium.
2 Assignments
0 Petitions
Accused Products
Abstract
An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. Related phrases and phrase extensions are also identified. Phrases in a query are identified and used to retrieve and rank documents. Phrases are also used to cluster documents in the search results, create document descriptions, and eliminate duplicate documents from the search results, and from the index.
-
Citations
25 Claims
-
1. A computer implemented method for identifying phrases in a document collection, the method comprising:
-
collecting possible phrases from documents in the document collection; classifying individual possible phrases as either a good phrase or a bad phrase according to a frequency of occurrence of the individual possible phrase; determining, for a pair of good phrases gj and gk in the document collection, an information gain of gk with respect to gj as a function of an actual co-occurrence rate of gj and gk and an expected co-occurrence rate of gj and gk in the document collection; selectively retaining as valid phrases those good phrases that, predict the occurrence of at least one other good phrase in the document collection, where a good phrase gj predicts the occurrence of another good phrase gk in the document collection when the determined information gain of gk in the presence of gj exceeds a first predetermined threshold; identifying, for a plurality of selectively retained valid phrases gx, a phrase gy as a related phrase of gx where the information gain of gy in the presence of gx exceeds a second predetermined threshold that is more restrictive than the first predetermined threshold; and storing the valid phrases and identified related phrases on a computer-readable storage medium. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A computer program product, for identifying phrases in a document collection, comprising computer operable instructions stored on a computer-readable storage medium and configured to control a processor to perform the operation of:
-
collecting possible phrases from documents in the document collection; classifying individual possible phrases as either a good phrase or a bad phrase according to a frequency of occurrence of the possible phrase; determining, for a pair of good phrases gj and gk in the document collection, an information gain of gk with respect to gj as a function of an actual co-occurrence rate of gj and gk and an expected co-occurrence rate of gj and gk in the document collection; selectively retaining as valid phrases those good phrases that predict the occurrence of at least one other good phrase in the document collection, where a good phrase gj predicts the occurrence of another good phrase gk in the document collection when the determined information gain of gk in the presence of gj exceeds a first predetermined threshold; identifying, for a plurality of selectively retained valid phrases gx, a phrase gy as a related phrase of gx where the information gain of gy in the presence of gx exceeds a second predetermined threshold that is more restrictive than the first predetermined threshold; and storing the selectively retained good phrases on a computer-readable storage medium. - View Dependent Claims (22)
-
-
23. A computer implemented method for indexing documents in a document collection, the method comprising:
-
collecting possible phrases from documents in the document collection; classifying each possible phrase as either a good phrase or a bad phrase according to a frequency of occurrence of the possible phrase; determining, for a pair of good phrases gj and gk in the document collection, an information gain of gk with respect to gj as a function of an actual co-occurrence rate of gj and gk and an expected co-occurrence rate of gj and gk in the document collection; storing as valid phrases, on a computer-readable storage medium, only good phrases that predict the occurrence of at least one other good phrase in the document collection where a good phrase gj predicts the occurrence of another good phrase gk in the document collection when the determined information gain of gk in the presence of gj exceeds a first predetermined threshold; identifying, for a plurality of selectively retained valid phrases gx, a phrase gy as a related phrase of gx where the information gain of gy in the presence of gx exceeds a second predetermined threshold that is more restrictive than the first predetermined threshold; and indexing documents in the document collection with respect to the valid phrases and the related phrases. - View Dependent Claims (24, 25)
-
Specification