Keyword extracting system and text retrieval system using the same
First Claim
1. A method of assisting a user to search a text base in a text retrieval system having a function of receiving a query request and returning a list of text IDs of retrieved texts;
- the method comprising the steps of;
for each of texts constituting said text base, managing local statistical information on words, compound words and phrases (hereinafter, referred to en bloc as “
words”
) used in each said text;
managing global statistical information on words used in any of said texts constituting said text base;
said user selecting at least one text from said text base to provide a selected text list of text IDs of selected texts by user implementation of the steps of;
issuing a query request by using user determined retrieval conditions to obtain a list of retrieved texts, and selecting at least one text from said retrieved texts;
for each of words contained in said selected texts, calculating a degree of importance by using said local statistical information for said retrieval texts and said global statistical information;
sorting said words contained in said selected texts in order of said degrees of importance;
displaying a predetermined number of said sorted words as related keywords; and
assisting said user to enter a query request by using said related keywords.
2 Assignments
0 Petitions
Accused Products
Abstract
A system for providing keywords to facilitate a search in a text retrieval system. For each of texts constituting a text base, the system creates a word ID of each of words used in the text and a word occurrence count of a corresponding word. The word occurrence count indicates a number of occurrences of a word in each text. For each of words used in any of the texts constituting the text base, the system creates a total word occurrence count and a containing text count indicative of the number of texts containing the word. For each of words contained in the selected texts, a degree of importance is calculated by using the word occurrence count, the total word occurrence count and the containing text count. The words contained in the selected texts are sorted in order of the degree of importance. At least a part of the sorted words are displayed as related keywords.
158 Citations
75 Claims
-
1. A method of assisting a user to search a text base in a text retrieval system having a function of receiving a query request and returning a list of text IDs of retrieved texts;
- the method comprising the steps of;
for each of texts constituting said text base, managing local statistical information on words, compound words and phrases (hereinafter, referred to en bloc as “
words”
) used in each said text;
managing global statistical information on words used in any of said texts constituting said text base;
said user selecting at least one text from said text base to provide a selected text list of text IDs of selected texts by user implementation of the steps of;
issuing a query request by using user determined retrieval conditions to obtain a list of retrieved texts, and selecting at least one text from said retrieved texts;
for each of words contained in said selected texts, calculating a degree of importance by using said local statistical information for said retrieval texts and said global statistical information;
sorting said words contained in said selected texts in order of said degrees of importance;
displaying a predetermined number of said sorted words as related keywords; and
assisting said user to enter a query request by using said related keywords. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27)
wherein said step of managing global statistic information includes the step of including, in said global statistical information, a word ID of each of said words used in any of said text constituting said text base, a total word occurrence count and a containing text count which are associated with said word ID of each said word used in any said text said total word occurrence count indicating a total number of occurrences in all of said texts constituting said text base and said containing text count indicating a number of texts containing each said word used in any said text, and wherein the method further comprises the step of defining said degree of importance such that said degree of importance is proportional to a sum of said word occurrence counts taken for said retrieved texts, a number of said retrieved texts, and a quantity defined for each said word contained in said retrieved texts such that if each said word appears in more of said texts constituting said text base, said quantity becomes the smaller. -
3. A method as defined in claim 2, wherein said step of defining said degree of importance comprises the step of expressing said degree of importance I(Wj) as:
-
where Wj is a word ID of each said word contained in said retrieved texts, C is a constant, WOr(Wj) is said word occurrence count of each said word Wj in each said retrieved text RTr, RCT(Wj) is a number of said retrieved texts which contain each said word Wj, and IDF(Wj) is said quantity, where RTr is a text ID of each said retrieved text and r=1, 2, . . . , R (R=a number of retrieved texts).
-
-
4. A method as defined in claim 1, further comprising the steps of:
-
said user issuing a further query request to obtain such a smaller list as is a subset of said list;
calculating a distribution index for each said word contained in said selected texts by using statistical information on words used in said selected texts and statistical information on words contained in texts listed in said smaller list, said distribution index being so defined that if each said word contained in said selected texts is distributed in more of texts listed in said smaller list and distributed in less of said selected texts, said index becomes larger; and
weighting said degree of importance with said distribution index.
-
-
5. A method as defined in claim 4, wherein said distribution index is expressed as {(MA/CTA(Wj))*(CTB(Wj)/MB)}, where MA and MB are numbers of texts listed in said list and said smaller list, respectively, and CTA(Wj) and CTB(Wj) are numbers of texts which are listed in said list and said smaller list, respectively, and which contain each said word Wj contained in said selected texts.
-
6. A method as defined in claim 2, wherein said selected text list is sorted in order of degrees of congruity of said selected texts, wherein the method further comprises the step of receiving said sorted list and assigning each of said selected texts of said sorted list a predetermined weight, and wherein said step of calculating said degree of importance includes the step of weighting said word occurrence count with said predetermined weight.
-
7. A method as defined in claim 2, further comprising the steps of:
assigning a weight to each of said selected texts, wherein said step of calculating said degree of importance includes the step of weighting said word occurrence count for each said selected text with said weight assigned to each said selected text.
-
8. A method as defined in claim 1, further comprising the steps of:
-
for each said word contained in said selected texts, making a test to see if a number of texts containing the word is within a predetermined range; and
if said word did not pass said test, excluding said word from candidates of said related keywords.
-
-
9. A method as defined in claim 8, further comprising the step of using, as said predetermined range, a value associated with a quantity characteristic of said word.
-
10. A method as defined in claim 9, wherein said quantity is a length of said word.
-
11. A method as defined in claim 8, further comprising the step of associating each of second predetermined ranges of a quantity characteristic of said word with a different predetermined range of said number of texts containing the word, wherein said step of making a test includes the step of using, as said predetermined range, one of said different predetermined ranges associated with a second predetermined range on which said quantity characteristic of said word falls.
-
12. A method as defined in claim 2, further comprising the steps of:
-
for each of texts constituting said text base, managing each occurrence of each said word in each said text constituting said text base and a part, of each said text, of said each occurrence;
assigning each of possible parts of each said text a predetermined weight factor; and
for each said text, accumulating said predetermined weight factor associated with said part of said each occurrence of each said word to yield a weight by text to each said word, wherein said step of defining said degree of importance includes the step of weighting each of said word occurrence counts with said weight by text.
-
-
13. A method as defined in claim 1, farther comprising the steps of:
-
for each of texts constituting said text base, managing each occurrence of each said word in each said text constituting said text base and a location, in each said text, of said each occurrence;
calculating, for said each occurrence of each said word in each said text, a distance between said location and a location of each of keywords used in said query request;
assigning each of predetermined distance ranges a predetermined weight factor; and
for each of texts constituting said text base, accumulating said predetermined weight factor associated with said distance for each said keyword for said each occurrence of each said word to yield a weight by texts to each said word, wherein said step of defining said degree of importance includes the step of weighting each of said word occurrence counts with said weight by text.
-
-
14. A method as defined in claim 1, further comprising the step of weighting said degree of importance with a weight associated with an attribute of each said word in said selected texts.
-
15. A method as defined in claim 1, further comprising the step of:
if any inclusion relation is found either in any two of said sorted words or between any of said sorted words and any of keywords used in said query request, selecting one of two words involved in said inclusion relation on a basis of a predetermined criterion.
-
16. A method as defined in claim 15, further comprising the step of setting said predetermined criterion for a comparison of lengths between said two words involved in said inclusion relation.
-
17. A method as defined in claim 15, further comprising the step of setting said predetermined criterion for a comparison of degrees of importance between said two words involved in said inclusion relation.
-
18. A method as defined in claim 15, wherein said step of selecting one of two words includes the step of selecting a shorter words and/or a difference between said two words.
-
19. A method as defined in claim 3, further comprising the steps of:
-
on a basis of keywords used in said query request and said list from said function, sorting said list in order of degrees of congruity of said selected texts; and
assigning each of said selected texts of said sorted list a predetermined weight, wherein said step of expressing said degree of importance includes the step of weighting said word occurrence count WOr(Wj) with one of said predetermined weights associated with each said retrieved text RTr.
-
-
20. A method as defined in claim 1, further comprising the step of classifying said sorted words by attributes of said sorted words into groups of similar keywords for display.
-
21. A method as defined in claim 1, further comprising the step of classifying said sorted words by statistical data of said sorted words into groups of similar keywords for display.
-
22. A method as defined in claim 1, further comprising the step of classifying said sorted words by a thesaurus into groups of similar keywords for display.
-
23. A method as defined in claim 20, further comprising the step of displaying representative keywords in place of said groups.
-
24. A method as defined in claim 21, further comprising the step of displaying representative keywords in place of said groups.
-
25. A method as defined in claim 22, further comprising the step of displaying representative keywords in place of said groups.
-
26. A method as defined in claim 1, wherein said assisting said user includes the step of, in response to a predetermined input from said user, automatically generating said query request by using at least a part of said predetermined number of said related words.
-
27. A method as defined in claim 1, further comprising the steps of storing said predetermined number of said related words;
- and
in response to a predetermined input from said user, displaying said stored predetermined number of said related words.
- and
- the method comprising the steps of;
-
28. A system for assisting a user to search a text base in a text retrieval system having a function of receiving a query request and returning a list of text IDs of retrieved texts;
- the system comprising;
means, operative for each of texts constituting said text base, for managing local statistical information on words used in each said text;
means for managing global statistical information on words used in any of said texts constituting said text base;
means for permitting said user to select at least one text from said text base to provide a selected text list of text IDs of selected texts by permitting said user to issue a query request by using user determined retrieval conditions to obtain a list of retrieved texts and by permitting said user to select at least one text from said retrieved texts;
means, operative for each of words contained in said selected texts listed in said selected text list, for calculating a degree of importance by using said local statistical information for said retrieval texts and said global statistical information;
means for sorting said words contained in said selected texts in order of said degrees of importance;
means for displaying a predetermined number of said sorted words with highest degrees of importance as related keywords; and
means for assisting said user to enter a query request by using said related keywords. - View Dependent Claims (29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54)
wherein said means for managing global statistic information includes means for including, in said global statistic information, a word ID of each of said words used in any of said texts constituting said text base, a total word occurrence count and a containing text count which are associated with said word ID of each said word used in any said text, said total word occurrence count indicating a total number of occurrences in all of said tests constituting said text base and said containing text count indicating a number of texts containing each said word used in any said text, and wherein the system further comprises means for defining said degree of importance such that said degree of importance is proportional to a sum of said word occurrence counts taken for said selected texts, a number of said selected texts, and a quantity defined for each said word contained in said selected texts such that if each said word appears in more of said texts constituting said text base, said quantity becomes the smaller. -
30. A system as defined in claim 29, wherein said means for defining said degree of importance comprises means for expressing said degree of importance I(Wj) as:
-
where Wj is a word ID of each said word contained in said selected texts, C is a constant, WOr(Wj) is said word occurrence count of each said word Wj in each said retrieved text RTr, RCT(Wj) is a number of said selected texts which contain each said word Wj, and IDF(Wj) is said quantity, where RTr is a text ID of each said retrieved test and r=1, 2, . . . , R (R=a number of selected texts).
-
-
31. A system as defined in claim 28, further comprising:
-
means, responsive to a determination that a further query request from said user has caused said function to return such a smaller list as is a subset of said list, for calculating a distribution index for each said word contained in said selected texts by using statistical information on words used in said selected texts and statistical information on words contained in texts listed in said smaller list, said distribution index being so defined that if each said word contained in said selected texts is distributed in more of texts listed in said smaller list and distributed in less of said selected texts, said index becomes larger; and
means for weighting said degree of importance with said distribution index.
-
-
32. A system as defined in claim 31, wherein said distribution index is expressed as {(MA/CTA(Wj))*(CTB(Wj)/MB)}, where MA and MB are numbers of texts listed in said list and said smaller list, respectively, and CTA(Wj) and CTB(Wj) are numbers of texts which are listed in said list and said smaller list, respectively, and which contain each said word Wj contained in said selected texts.
-
33. A system as defined in claim 29, wherein said list is sorted in order of degrees of congruity of said selected texts, wherein the system further comprises means for receiving said sorted list and assigning each of said selected texts of said sorted list a predetermined weight, and wherein said means for calculating said degree of importance includes means for weighting said word occurrence count with said predetermined weight.
-
34. A system as defined in claim 29, further comprising means for permitting said user to assigning a weight to each of said selected texts, wherein said means for calculating said degree of importance includes means for weighting said word occurrence count for each said selected text with said weight assigned to each said selected text.
-
35. A system as defined in claim 28, further comprising:
-
means, operative for each said word contained in said selected texts, for making a test to see if a number of texts containing the word is within a predetermined range; and
means, responsive to a determination that said word did not pass said test, for excluding said word from candidates of said related keywords.
-
-
36. A system as defined in claim 35, further comprising means for using, as said predetermined range, a value associated with a quantity characteristic of said word.
-
37. A system as defined in claim 36, wherein said quantity is a length of said word.
-
38. A system as defined in claim 35, further comprising means for associating each of second predetermined ranges of a quantity characteristic of said word with a different predetermined range of said number of texts containing the word, wherein said means for making a test includes means for using, as said predetermined range, one of said different predetermined ranges associated with a second predetermined range on which said quantity characteristic of said word falls.
-
39. A system as defined in claim 29, further comprising:
-
means, operative for each of texts constituting said text base, for managing each occurrence of each said word in each said text constituting said text base and a part, of each said text, of said cach occurrence;
means for assigning each of possible parts of each said text a predetermined weight factor; and
means operative for each said text for accumulating said predetermined weight factor associated with said part of said each occurrence of each said word to yield a weight by text to each said word, wherein said means for defining said degree of importance includes means for weighting each of said word occurrence counts with said weight by text.
-
-
40. A system as defined in claim 28, further comprising:
-
means, operative for each of texts constituting said text base, for managing each occurrence of each said word in each said text constituting said text base and a location, in each said text, of said each occurrence;
means for calculating, for said each occurrence of each said word in each said text, a distance between said location and a location of each of keywords used in said query request;
means for assigning each of predetermined distance ranges a predetermined weight factor; and
means, operative for each of texts constituting said text base, for accumulating said predetermined weight factor associated with said distance for each said keyword for said each occurrence of each said word to yield a weight by texts to each said word, wherein said means for defining said degree of importance includes means for weighting each of said word occurrence counts with said weight by text.
-
-
41. A system as defined in claim 28, further comprising means for weighting said degree of importance with a weight associated with an attribute of each said word in said selected texts.
-
42. A system as defined in claim 28, further comprising means, responsive to a determination that any inclusion relation is found either in any two of said sorted words or between any of said sorted words and any of keywords used in said query request, for selecting one of two words involved in said inclusion relation on a basis of a predetermined criterion.
-
43. A system as defined in claim 42, further comprising means for setting said predetermined criterion for a comparison of lengths between said two words involved in said inclusion relation.
-
44. A system as defined in claim 42, further comprising means for setting said predetermined criterion for a comparison of degrees of importance between said two words involved in said inclusion relation.
-
45. A system as defined in claim 42, wherein said means for selecting one of two words includes means for selecting a shorter words and/or a difference between said two words.
-
46. A system as defined in claim 30, further comprising:
-
means, operative on a basis of keywords used in said query request and said list from said function, for sorting said list in order of degrees of congruity of said selected texts; and
means for assigning each of said selected texts of said sorted list a predetermined weight, wherein said means for expressing said degree of importance includes means for weighting said word occurrence count WOr(Wj) with one of said predetermined weights associated with each said retrieved text RTr.
-
-
47. A system as defined in claim 28, further comprising means for classifying said sorted words by attributes of said sorted words into groups of similar keywords for display.
-
48. A system as defined in claim 28, further comprising means for classifying said sorted words by statistical data of said sorted words into groups of similar keywords for display.
-
49. A system as defined in claim 28, further comprising means for classifying said sorted words by a thesaurus into groups of similar keywords for display.
-
50. A system as defined in claim 47, further comprising means for displaying representative keywords in place of said groups.
-
51. A system as defined in claim 48, further comprising means for displaying representative keywords in place of said groups.
-
52. A system as defined in claim 49, further comprising means for displaying representative keywords in place of said groups.
-
53. A system as defined in claim 28, wherein said means for assisting said user includes means, responsive to a predetermined input from said user, for automatically generating said query request by using at least a part of said predetermined number of said related words.
-
54. A system as defined in claim 28, further comprising means for storing said predetermined number of said related words;
- and
means responsive to a predetermined input from said user for displaying said stored predetermined number of said related words.
- and
- the system comprising;
-
55. A text retrieval system capable of assisting a user to search a text base by providing keywords on the basis of at least one preceding search, the text retrieval system comprising:
-
a multiplicity of texts constituting said text base;
means for managing attribute information on said texts constituting said text base;
means, operative for each of texts constituting said text base, for managing local statistical information on words used in each said text;
means for managing global statistical information on words used in any of said texts constituting said text base;
means for permitting said user to issue a query request;
means responsive to said query request for providing a list of text IDs of selected texts;
means, operative for each of words contained in said selected texts listed in said selected text list, for calculating a degree of importance by using said local statistical information for said retrieval texts and said global statistical information;
means for sorting said words contained in said selected texts in order of said degrees of importance;
means for displaying a predetermined number of said sorted words with highest degrees of importance as related keywords; and
means for assisting said user to enter a query request by using said related keywords. - View Dependent Claims (56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75)
wherein said means for managing global statistic information includes a global statistical table for storing a word ID of each of said words used in any of said texts constituting said text base, a total word occurrence count and a containing text count which are associated with said word ID of each said word used in any said text, said total word occurrence count indicating a total number of occurrences in all of said texts constituting said text base and said containing text count indicating a number of texts containing each said word used in any said text, and wherein said degree of importance is proportional to a sum of said word occurrence counts taken for said selected texts, a number of said selected texts, and a quantity defined for each said word contained in said selected texts such that if each said word appears in more of said texts constituting said text base, said quantity becomes the smaller. -
57. A system as defined in claim 56, wherein said degree of importance, I(Wj), is defined as:
-
where Wj is a word ID of each said word contained in said selected texts, C is a constant, WOr(Wj) is said word occurrence count of each said word Wj in each said retrieved text RTr, RCT(Wj) is a number of said selected texts which contain each said word Wj, and IDF(Wj) is said quantity, where RTr is a text ID of each said retrieved text and r=1, 2, . . . , R (R=a number of selected texts).
-
-
58. A system as defined in claim 55, further comprising:
-
means, responsive to a determination that a first query request and a second query request issued after said first one have resulted in a first list of first text IDs of first selected texts and a second list of second text IDs of second selected texts such that said second list is a subset of said first list, for calculating a distribution index for each said word contained in said first selected texts by using statistical information on words used in said first selected texts and statistical information on words used in said second selected texts, said distribution index being so defined that if each word is distributed in more of texts listed in said second list and distributed in less of said first selected texts, said index of the word becomes larger, and means for weighting said degree of importance with said distribution index.
-
-
59. A system as defined in claim 58, wherein said distribution index is expressed as {(MA/CTA(Wj))*(CTB(Wj)/MB)}, where MA and MB are numbers of texts listed in said list and said smaller list, respectively, and CTA(Wj) and CTB(Wj) are numbers of texts which are listed in said list and said smaller list, respectively, and which contain each said word Wj contained in said selected texts.
-
60. A system as defined in claim 56, wherein said list is sorted in order of degrees of congruity of said selected texts, wherein the system further comprises means for receiving said sorted list and assigning each of said selected texts of said sorted list a predetermined weight, and wherein said means for calculating said degree of importance includes means for weighting said word occurrence count with said predetermined weight.
-
61. A system as defined in claim 55, further comprising:
-
means, operative for each said word contained in said selected text, for making a test to see if a number of texts containing the word is within a predetermined range; and
means, responsive to a determination that said word did not pass said test, for excluding said word from candidates of said related keywords.
-
-
62. A system as defined in claim 56, further comprising:
-
means, operative for each of texts constituting said text base, for managing each occurrence of each said word in each said text constituting said text base and a part, of each said text, of said each occurrence;
means for assigning each of possible parts of each said test a predetermined weight factor; and
means operative for each said text for accumulating said predetermined weight factor associated with said part of said each occurrence of each said word to yield a weight by text to each said word, wherein said means for defining said degree of importance includes means for weighting each of said word occurrence counts with said weight by text.
-
-
63. A system as defined in claim 55, further comprising:
-
means, operative for each of texts constituting said text base, for managing each occurrence of each said word in each said test constituting said text base and a location, in each said text, of said each occurrence;
means for calculating, for said each occurrence of each said word in each said text, a distance between said location and a location of each of keywords used in said query request;
means for assigning each of predetermined distance ranges a predetermined weight factor; and
means, operative for each of texts constituting said test base, for accumulating said predetermined weight factor associated with said distance for each said keyword for said each occurrence of each said word to yield a weight by texts to each said word, wherein said means for defining said degree of importance includes means for weighting each of said word occurrence counts with said weight by text.
-
-
64. A system as defined in claim 55, further comprising means for weighting said degree of importance with a weight associated with an attribute of each said word in said selected texts.
-
65. A system as defined in claim 55, further comprising means, responsive to a determination that any inclusion relation is found either in any two of said sorted words or between any of said sorted words and any of keywords used in said query request, for selecting one of two words involved in said inclusion relation on a basis of a predetermined criterion.
-
66. A system as defined in claim 64, further comprising means for setting said predetermined criterion for a comparison of lengths between said two words involved in said inclusion relation.
-
67. A system as defined in claim 64, further comprising means for setting said predetermined criterion for a comparison of degrees of importance between said two words involved in said inclusion relation.
-
68. A system as defined in claim 64, wherein said means for selecting one of two words includes means for selecting a shorter words and/or a difference between said two words.
-
69. A system as defined in claim 57, further comprising:
-
means, operative on a basis of keywords used in said query request and said list from said function, for sorting said list in order of degrees of congruity of said selected texts; and
means for assigning each of said selected texts of said sorted list a predetermined weight, wherein said means for expressing said degree of importance includes means for weighting said word occurrence count WOr(Wj) with one of said predetermined weights associated with each said retrieved text RTr.
-
-
70. A system as defined in claim 55, further comprising means for classifying said sorted words by attributes of said sorted words into groups of similar keywords for display.
-
71. A system as defined in claim 70, further comprising means for displaying representative keywords in place of said groups.
-
72. A system as defined in claim 55, wherein said means for assisting said user includes means, responsive to a predetermined input from said user, for automatically generating said query request by using at least a part of said predetermined number of said related words.
-
73. A system as defined in claim 55, further comprising means for storing said predetermined number of said related words;
- and
means responsive to a predetermined input from said user for displaying said stored predetermined number of said related words.
- and
-
74. A system as defined in claim 55, further comprising a storage media drive adapted for a detachable mass storage medium, wherein said multiplicity of texts constituting said text base are stored in one of said detachable mass storage media.
-
75. A system as defined in claim 55, further comprising a two way communication means, wherein the system is distributed on a server and client system.
-
Specification