Topic identification and use thereof in information retrieval systems
First Claim
Patent Images
1. A method to identify topics in a data corpus having a plurality of segments, comprising:
- determining a segment-level actual usage value for one or more word combinations, wherein a word combination includes two or more substantially contiguous words, wherein two words are substantially contiguous if they are separated by zero words or words selected from a predetermined list of words;
computing a segment-level expected usage value for each of the one or more word combinations in accordance with S(wi)xS(wj) x . . . x S(wm)/Nm−
1 where “
m”
represents the number of words in the word combination, “
N”
represents the number of segments in the data corpus, and S(w) represents the number of unique segments in the data corpus that word wi of the word combination is in;
designating a word combination as a topic if the segment level actual usage value of the word combination is greater than the segment-level expected usage value of the word combination; and
storing the topic on a computer readable storage medium.
15 Assignments
0 Petitions
Accused Products
Abstract
A technique to determine topics associated with, or classifications for, a data corpus uses an initial domain-specific word list to identify word combinations (one or more words) that appear in the data corpus significantly more often than expected. Word combinations so identified are selected as topics and associated with a user-specified level of granularity. For example, topics may be associated with each table entry, each image, each sentence, each paragraph, or an entire file. Topics may be used to guide information retrieval and/or the display of topic classifications during user query operations.
92 Citations
38 Claims
-
1. A method to identify topics in a data corpus having a plurality of segments, comprising:
-
determining a segment-level actual usage value for one or more word combinations, wherein a word combination includes two or more substantially contiguous words, wherein two words are substantially contiguous if they are separated by zero words or words selected from a predetermined list of words; computing a segment-level expected usage value for each of the one or more word combinations in accordance with S(wi)xS(wj) x . . . x S(wm)/Nm−
1 where “
m”
represents the number of words in the word combination, “
N”
represents the number of segments in the data corpus, and S(w) represents the number of unique segments in the data corpus that word wi of the word combination is in;designating a word combination as a topic if the segment level actual usage value of the word combination is greater than the segment-level expected usage value of the word combination; and storing the topic on a computer readable storage medium. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38)
-
-
11. A program storage device, readable by a programmable control device, comprising instructions stored on the program storage device for causing the programmable control device to identify topics in a data corpus having a plurality of segments, the instructions causing the programmable control device to:
-
determine a segment-level actual usage value for one or more word combinations, wherein a word combination includes two or more substantially contiguous words, wherein two words are substantially contiguous if they are separated by zero words or words selected from a predetermined list of words; compute a segment-level expected usage value for each of the one or more word combinations in accordance with S(wi)xS(wj) x . . . x S(wm)/Nm−
1 where “
m”
represents the number of words in the word combination,“
N”
represents the number of segments in the data corpus, and S(w) represents the number of unique segments in the data corpus that word wi of the word combination is in; anddesignate a word combination as a topic if the segment level actual usage value of the word combination is greater than the segment-level expected usage value of the word combination. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18)
-
Specification