Automatically generating a topic description for text and searching and sorting text by topic using the same
First Claim
1. A method of automatically generating a topical description of text, comprising the steps of:
- a) receiving the text, where the text consists of one or more input words;
b) stemming each input word to its root form;
c) assigning a user-definable part-of-speech score β
i to each input word;
d) assigning a language salience score Si to each input word;
e) assigning an input-word score to each input word that is a function of the corresponding input word'"'"'s part-of-speech score β
i, language salience score Si, and the number of times the corresponding input word appears in the text;
f) creating a tree structure under each input word, where each tree structure contains the definition of the corresponding input word, where each definition word may be further defined to a user-definable number of levels;
g) assigning a definition-word score Ai,t j! to each definition word in each tree structure based on the definition word'"'"'s part-of-speech score β
j, the language salience score of the word the definition word defines, a relational salience score Rk,j, and a user-definable factor W;
h) collapsing each tree structure to a corresponding tree-word list, where each tree-word list contains the unique words contained in the corresponding tree structure;
i) assigning a tree-word-list score to each word in each tree-word list, where each tree-word-list score is a function of the scores of the corresponding word that existed in the corresponding uncollapsed tree structure;
j) combining the tree-word lists into a final word list, where the final word list contains the unique words contained in the tree-word lists;
k) assigning a final-word-list score Afi j! to each word in the final word list, where Afi j! is a function of the corresponding word'"'"'s dictionary salience and tree-word-list scores; and
l) choosing the top N scoring words in the final word list as the topic description of the input text, where the value N may be defined by the user.
1 Assignment
0 Petitions
Accused Products
Abstract
A method of automatically generating a topical description of text by receiving the text containing input words; stemming each input word to its root form; assigning a user-definable part-of-speech score to each input word; assigning a language salience score to each input word; assigning an input-word score to each input word; creating a tree structure under each input word, where each tree structure contains the definition of the corresponding input word; assigning a definition-word score to each definition word; collapsing each tree structure to a corresponding tree-word list; assigning a tree-word-list score to each entry in each tree-word list; combining the tree-word lists into a final word list; assigning each word in the final word list a final-word-list score; and choosing the top N scoring words in the final word list as the topic description of the input text. Document searching and sorting may be accomplished by performing the method described above on each document in a database and then comparing the similarity of the resulting topical descriptions.
390 Citations
31 Claims
-
1. A method of automatically generating a topical description of text, comprising the steps of:
-
a) receiving the text, where the text consists of one or more input words; b) stemming each input word to its root form; c) assigning a user-definable part-of-speech score β
i to each input word;d) assigning a language salience score Si to each input word; e) assigning an input-word score to each input word that is a function of the corresponding input word'"'"'s part-of-speech score β
i, language salience score Si, and the number of times the corresponding input word appears in the text;f) creating a tree structure under each input word, where each tree structure contains the definition of the corresponding input word, where each definition word may be further defined to a user-definable number of levels; g) assigning a definition-word score Ai,t j! to each definition word in each tree structure based on the definition word'"'"'s part-of-speech score β
j, the language salience score of the word the definition word defines, a relational salience score Rk,j, and a user-definable factor W;h) collapsing each tree structure to a corresponding tree-word list, where each tree-word list contains the unique words contained in the corresponding tree structure; i) assigning a tree-word-list score to each word in each tree-word list, where each tree-word-list score is a function of the scores of the corresponding word that existed in the corresponding uncollapsed tree structure; j) combining the tree-word lists into a final word list, where the final word list contains the unique words contained in the tree-word lists; k) assigning a final-word-list score Afi j! to each word in the final word list, where Afi j! is a function of the corresponding word'"'"'s dictionary salience and tree-word-list scores; and l) choosing the top N scoring words in the final word list as the topic description of the input text, where the value N may be defined by the user. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31)
-
15. The method of claim 1, further comprising the step of translating the topic description into a language different from the input text and the language of the dictionary.
-
16. The method of claim 1, further comprising the steps of:
-
a) receiving a plurality of documents, where one of said plurality of documents is identified as the document of interest; b) determining a topic description for each of said plurality of documents; c) comparing the topic descriptions of each of said plurality of documents to the topic description of said document of interest; and d) returning each of said plurality of documents that has a topic description that is sufficiently similar to the topic description of said document of interest.
-
-
17. The method of claim 1, further comprising the steps of:
-
a) receiving a plurality of documents; b) determining a topic description for each of said plurality of documents; c) comparing the topic descriptions of each of said plurality of documents to each other of said plurality of documents; and d) sorting said plurality of documents by topic description.
-
-
18. The method of claim 2, wherein said step of assigning a language salience score Si to each input word is comprised of the step of determining the language salience score for each input word from the frequency count fi of each word in a large corpus of text as follows:
-
space="preserve" listing-type="equation">S.sub.i =0, if f.sub.i >
f.sub.max ;
space="preserve" listing-type="equation">S.sub.i =log (f.sub.max /(f.sub.i -T.sup.2 +T)), if T.sup.2 <
f.sub.i <
f.sub.max ;
space="preserve" listing-type="equation">S.sub.i =log (f.sub.max /T), if T<
f.sub.i<
T.sup.2 ;
and
space="preserve" listing-type="equation">S.sub.i =ε
+((f.sub.i /T)(log(f.sub.max /T)-ε
)), if f.sub.i ≦
T,where ε and
T are user-definable values, and where fmax represents a point where the sum of frequencies of occurrence above the point equals the sum of frequencies of occurrence below the point.
-
-
19. The method of claim 18, wherein said step of assigning a language salience score Si to each input word further comprises the step of allowing the user to over-ride the language salience score for a particular word with a user-definable language salience score.
-
20. The method of claim 19, wherein said step of assigning an input-word score to each input word is comprised of the step of assigning an input-word score where said input-word score is selecting from the group consisting of mSi β
-
i and (Si m)β
i, where m is the number of times the corresponding input word occurs in the text.
-
i and (Si m)β
-
21. The method of claim 20, wherein said step of creating a tree structure under each input word is comprised of creating a tree structure under each input word using a recursively closed dictionary.
-
22. The method of claim 21, wherein said step of creating a tree structure under each input word is comprised of creating a tree structure under each input word using a recursively closed dictionary that is in a different language than the text.
-
23. The method of claim 22, wherein said step of assigning a definition-word score to each definition word in each tree structure is comprised of assigning a definition-word score to each definition word as follows:
- Ai,t j!=W(β
j, t)Σ
Ai,t-1 k!Rk,j, where Ri,j =Dj /Σ
Dk), where Σ
Dk represents the sum of the dictionary saliences of the words in the definition of word wi, where Dj =β
j (Sj log(dmax /dj)) 0.5, where dt is the number of dictionary terms that use the corresponding word in its definition, and where dmax is the number of times the most frequently used word in the dictionary is used.
- Ai,t j!=W(β
-
24. The method of claim 23, wherein said step of assigning a definition-word score to each definition word in each tree structure is comprised of assigning a definition-word score to each definition word as follows:
- Ai,t j!=W(β
j,t)Σ
Ai,t-1 k!Rk, j, where Ri, j =Dj /Σ
Dk), where Σ
Dk represents the sum of the dictionary saliences of the words in the definition of word wi, where Dj =β
j (Sj log (dm /Δ
j)) 0.5, where Δ
j =max(dj, ε
), and dm is chosen such that a fixed percentage of the observed values of the dj '"'"'s are larger than dm.
- Ai,t j!=W(β
-
25. The method of claim 24, wherein said step of assigning a definition-word score is comprised of the step of assigning a score to each definition word that is user-definable.
-
26. The method of claim 25, wherein said step of collapsing each tree structure is comprised of collapsing each tree structure to a corresponding tree-word list, where each tree-word list contains only salient input words and definition words in a particular tree structure having the highest score while ignoring lower scoring definition words in that tree structure even if the lower scoring definition words score higher than definition words contained in other tree structures.
-
27. The method of claim 26, wherein said step of assigning a tree-word-list score to each word in each tree-word list is comprised of assigning a tree-word-list score that is the sum of the scores associated with the word in its corresponding tree structure.
- 28. The method of claim 27, wherein said step of assigning a final word list score is comprised of the step of assigning a final word list score according to the following equation
- space="preserve" listing-type="equation">A.sub.fi j!=((D.sub.j (f(A.sub.i j!)))Σ
A.sub.i j!).
- space="preserve" listing-type="equation">A.sub.fi j!=((D.sub.j (f(A.sub.i j!)))Σ
-
-
29. The method of claim 28, further comprising the step of translating the topic description into a language different from the input text and the language of the dictionary.
-
30. The method of claim 29, further comprising the steps of:
-
a) receiving a plurality of documents, where one of said plurality of documents is identified as the document of interest; b) determining a topic description for each of said plurality of documents; c) comparing the topic descriptions of each of said plurality of documents to the topic description of said document of interest; and d) returning each of said plurality of documents that has a topic description that is sufficiently similar to the topic description of said document of interest.
-
-
31. The method of claim 30, further comprising the steps of:
-
a) receiving a plurality of documents; b) determining a topic description for each of said plurality of documents; c) comparing the topic descriptions of each of said plurality of documents to each other of said plurality of documents; and d) sorting said plurality of documents by topic description.
-
Specification