Keyword extracting system and text retrieval system using the same

US 6,212,517 B1
Filed: 06/30/1998
Issued: 04/03/2001
Est. Priority Date: 07/02/1997
Status: Expired due to Term

First Claim

Patent Images

1. A method of assisting a user to search a text base in a text retrieval system having a function of receiving a query request and returning a list of text IDs of retrieved texts;

the method comprising the steps of;

for each of texts constituting said text base, managing local statistical information on words, compound words and phrases (hereinafter, referred to en bloc as “

words”

) used in each said text;

managing global statistical information on words used in any of said texts constituting said text base;

said user selecting at least one text from said text base to provide a selected text list of text IDs of selected texts by user implementation of the steps of;

issuing a query request by using user determined retrieval conditions to obtain a list of retrieved texts, and selecting at least one text from said retrieved texts;

for each of words contained in said selected texts, calculating a degree of importance by using said local statistical information for said retrieval texts and said global statistical information;

sorting said words contained in said selected texts in order of said degrees of importance;

displaying a predetermined number of said sorted words as related keywords; and

assisting said user to enter a query request by using said related keywords.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system for providing keywords to facilitate a search in a text retrieval system. For each of texts constituting a text base, the system creates a word ID of each of words used in the text and a word occurrence count of a corresponding word. The word occurrence count indicates a number of occurrences of a word in each text. For each of words used in any of the texts constituting the text base, the system creates a total word occurrence count and a containing text count indicative of the number of texts containing the word. For each of words contained in the selected texts, a degree of importance is calculated by using the word occurrence count, the total word occurrence count and the containing text count. The words contained in the selected texts are sorted in order of the degree of importance. At least a part of the sorted words are displayed as related keywords.

158 Citations

75 Claims

1. A method of assisting a user to search a text base in a text retrieval system having a function of receiving a query request and returning a list of text IDs of retrieved texts;
- the method comprising the steps of;
  
  for each of texts constituting said text base, managing local statistical information on words, compound words and phrases (hereinafter, referred to en bloc as “
  
  words”
  
  ) used in each said text;
  
  managing global statistical information on words used in any of said texts constituting said text base;
  
  said user selecting at least one text from said text base to provide a selected text list of text IDs of selected texts by user implementation of the steps of;
  
  issuing a query request by using user determined retrieval conditions to obtain a list of retrieved texts, and selecting at least one text from said retrieved texts;
  
  for each of words contained in said selected texts, calculating a degree of importance by using said local statistical information for said retrieval texts and said global statistical information;
  
  sorting said words contained in said selected texts in order of said degrees of importance;
  
  displaying a predetermined number of said sorted words as related keywords; and
  
  assisting said user to enter a query request by using said related keywords.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27)
- - 2. A method as defined in claim 1, wherein said step of managing local statistical information includes the step of including, in said local statistical information, a word ID of each of words used in each said text and a word occurrence count associated with said word ID, said word occurrence count indicating a number of occurrences, in each said text, of each said word used in each said text,
- 3. A method as defined in claim 2, wherein said step of defining said degree of importance comprises the step of expressing said degree of importance I(Wj) as:
  - $I (Wj) = C * \sum_{r = 1}^{R} {WOr (Wj) * IDF (Wj)} * RCT (Wj),$ where Wj is a word ID of each said word contained in said retrieved texts, C is a constant,WOr(Wj) is said word occurrence count of each said word Wj in each said retrieved text RTr, RCT(Wj) is a number of said retrieved texts which contain each said word Wj, and IDF(Wj) is said quantity, where RTr is a text ID of each said retrieved text and r=1, 2, . . . , R (R=a number of retrieved texts).
- 4. A method as defined in claim 1, further comprising the steps of:
  - said user issuing a further query request to obtain such a smaller list as is a subset of said list;
    
    calculating a distribution index for each said word contained in said selected texts by using statistical information on words used in said selected texts and statistical information on words contained in texts listed in said smaller list, said distribution index being so defined that if each said word contained in said selected texts is distributed in more of texts listed in said smaller list and distributed in less of said selected texts, said index becomes larger; and
    
    weighting said degree of importance with said distribution index.
- 5. A method as defined in claim 4, wherein said distribution index is expressed as {(MA/CTA(Wj))*(CTB(Wj)/MB)}, where MA and MB are numbers of texts listed in said list and said smaller list, respectively, and CTA(Wj) and CTB(Wj) are numbers of texts which are listed in said list and said smaller list, respectively, and which contain each said word Wj contained in said selected texts.
- 6. A method as defined in claim 2, wherein said selected text list is sorted in order of degrees of congruity of said selected texts, wherein the method further comprises the step of receiving said sorted list and assigning each of said selected texts of said sorted list a predetermined weight, and wherein said step of calculating said degree of importance includes the step of weighting said word occurrence count with said predetermined weight.
- 7. A method as defined in claim 2, further comprising the steps of:
  - assigning a weight to each of said selected texts, wherein said step of calculating said degree of importance includes the step of weighting said word occurrence count for each said selected text with said weight assigned to each said selected text.
- 8. A method as defined in claim 1, further comprising the steps of:
  - for each said word contained in said selected texts, making a test to see if a number of texts containing the word is within a predetermined range; and
    
    if said word did not pass said test, excluding said word from candidates of said related keywords.
- 9. A method as defined in claim 8, further comprising the step of using, as said predetermined range, a value associated with a quantity characteristic of said word.
- 10. A method as defined in claim 9, wherein said quantity is a length of said word.
- 11. A method as defined in claim 8, further comprising the step of associating each of second predetermined ranges of a quantity characteristic of said word with a different predetermined range of said number of texts containing the word, wherein said step of making a test includes the step of using, as said predetermined range, one of said different predetermined ranges associated with a second predetermined range on which said quantity characteristic of said word falls.
- 12. A method as defined in claim 2, further comprising the steps of:
  - for each of texts constituting said text base, managing each occurrence of each said word in each said text constituting said text base and a part, of each said text, of said each occurrence;
    
    assigning each of possible parts of each said text a predetermined weight factor; and
    
    for each said text, accumulating said predetermined weight factor associated with said part of said each occurrence of each said word to yield a weight by text to each said word, wherein said step of defining said degree of importance includes the step of weighting each of said word occurrence counts with said weight by text.
- 13. A method as defined in claim 1, farther comprising the steps of:
  - for each of texts constituting said text base, managing each occurrence of each said word in each said text constituting said text base and a location, in each said text, of said each occurrence;
    
    calculating, for said each occurrence of each said word in each said text, a distance between said location and a location of each of keywords used in said query request;
    
    assigning each of predetermined distance ranges a predetermined weight factor; and
    
    for each of texts constituting said text base, accumulating said predetermined weight factor associated with said distance for each said keyword for said each occurrence of each said word to yield a weight by texts to each said word, wherein said step of defining said degree of importance includes the step of weighting each of said word occurrence counts with said weight by text.
- 14. A method as defined in claim 1, further comprising the step of weighting said degree of importance with a weight associated with an attribute of each said word in said selected texts.
- 15. A method as defined in claim 1, further comprising the step of:
  - if any inclusion relation is found either in any two of said sorted words or between any of said sorted words and any of keywords used in said query request, selecting one of two words involved in said inclusion relation on a basis of a predetermined criterion.
- 16. A method as defined in claim 15, further comprising the step of setting said predetermined criterion for a comparison of lengths between said two words involved in said inclusion relation.
- 17. A method as defined in claim 15, further comprising the step of setting said predetermined criterion for a comparison of degrees of importance between said two words involved in said inclusion relation.
- 18. A method as defined in claim 15, wherein said step of selecting one of two words includes the step of selecting a shorter words and/or a difference between said two words.
- 19. A method as defined in claim 3, further comprising the steps of:
  - on a basis of keywords used in said query request and said list from said function, sorting said list in order of degrees of congruity of said selected texts; and
    
    assigning each of said selected texts of said sorted list a predetermined weight, wherein said step of expressing said degree of importance includes the step of weighting said word occurrence count WOr(Wj) with one of said predetermined weights associated with each said retrieved text RTr.
- 20. A method as defined in claim 1, further comprising the step of classifying said sorted words by attributes of said sorted words into groups of similar keywords for display.
- 21. A method as defined in claim 1, further comprising the step of classifying said sorted words by statistical data of said sorted words into groups of similar keywords for display.
- 22. A method as defined in claim 1, further comprising the step of classifying said sorted words by a thesaurus into groups of similar keywords for display.
- 23. A method as defined in claim 20, further comprising the step of displaying representative keywords in place of said groups.
- 24. A method as defined in claim 21, further comprising the step of displaying representative keywords in place of said groups.
- 25. A method as defined in claim 22, further comprising the step of displaying representative keywords in place of said groups.
- 26. A method as defined in claim 1, wherein said assisting said user includes the step of, in response to a predetermined input from said user, automatically generating said query request by using at least a part of said predetermined number of said related words.
- 27. A method as defined in claim 1, further comprising the steps of storing said predetermined number of said related words;
  - andin response to a predetermined input from said user, displaying said stored predetermined number of said related words.

28. A system for assisting a user to search a text base in a text retrieval system having a function of receiving a query request and returning a list of text IDs of retrieved texts;
- the system comprising;
  
  means, operative for each of texts constituting said text base, for managing local statistical information on words used in each said text;
  
  means for managing global statistical information on words used in any of said texts constituting said text base;
  
  means for permitting said user to select at least one text from said text base to provide a selected text list of text IDs of selected texts by permitting said user to issue a query request by using user determined retrieval conditions to obtain a list of retrieved texts and by permitting said user to select at least one text from said retrieved texts;
  
  means, operative for each of words contained in said selected texts listed in said selected text list, for calculating a degree of importance by using said local statistical information for said retrieval texts and said global statistical information;
  
  means for sorting said words contained in said selected texts in order of said degrees of importance;
  
  means for displaying a predetermined number of said sorted words with highest degrees of importance as related keywords; and
  
  means for assisting said user to enter a query request by using said related keywords.
- View Dependent Claims (29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54)
- - 29. A system as defined in claim 28, wherein said means for managing local statistical information includes means for including, in said local statistical information, a word ID of each of words used in each said text and a word occurrence count associated with said word ID, said word occurrence count indicating a number of occurrences, in each said text, of each said word used in each said text,
- 30. A system as defined in claim 29, wherein said means for defining said degree of importance comprises means for expressing said degree of importance I(Wj) as:
  - $I (Wj) = C * \sum_{r = 1}^{R} {WOr (Wj) * IDF (Wj)} * RCT (Wj),$ where Wj is a word ID of each said word contained in said selected texts, C is a constant,WOr(Wj) is said word occurrence count of each said word Wj in each said retrieved text RTr, RCT(Wj) is a number of said selected texts which contain each said word Wj, and IDF(Wj) is said quantity, where RTr is a text ID of each said retrieved test and r=1, 2, . . . , R (R=a number of selected texts).
- 31. A system as defined in claim 28, further comprising:
  - means, responsive to a determination that a further query request from said user has caused said function to return such a smaller list as is a subset of said list, for calculating a distribution index for each said word contained in said selected texts by using statistical information on words used in said selected texts and statistical information on words contained in texts listed in said smaller list, said distribution index being so defined that if each said word contained in said selected texts is distributed in more of texts listed in said smaller list and distributed in less of said selected texts, said index becomes larger; and
    
    means for weighting said degree of importance with said distribution index.
- 32. A system as defined in claim 31, wherein said distribution index is expressed as {(MA/CTA(Wj))*(CTB(Wj)/MB)}, where MA and MB are numbers of texts listed in said list and said smaller list, respectively, and CTA(Wj) and CTB(Wj) are numbers of texts which are listed in said list and said smaller list, respectively, and which contain each said word Wj contained in said selected texts.
- 33. A system as defined in claim 29, wherein said list is sorted in order of degrees of congruity of said selected texts, wherein the system further comprises means for receiving said sorted list and assigning each of said selected texts of said sorted list a predetermined weight, and wherein said means for calculating said degree of importance includes means for weighting said word occurrence count with said predetermined weight.
- 34. A system as defined in claim 29, further comprising means for permitting said user to assigning a weight to each of said selected texts, wherein said means for calculating said degree of importance includes means for weighting said word occurrence count for each said selected text with said weight assigned to each said selected text.
- 35. A system as defined in claim 28, further comprising:
  - means, operative for each said word contained in said selected texts, for making a test to see if a number of texts containing the word is within a predetermined range; and
    
    means, responsive to a determination that said word did not pass said test, for excluding said word from candidates of said related keywords.
- 36. A system as defined in claim 35, further comprising means for using, as said predetermined range, a value associated with a quantity characteristic of said word.
- 37. A system as defined in claim 36, wherein said quantity is a length of said word.
- 38. A system as defined in claim 35, further comprising means for associating each of second predetermined ranges of a quantity characteristic of said word with a different predetermined range of said number of texts containing the word, wherein said means for making a test includes means for using, as said predetermined range, one of said different predetermined ranges associated with a second predetermined range on which said quantity characteristic of said word falls.
- 39. A system as defined in claim 29, further comprising:
  - means, operative for each of texts constituting said text base, for managing each occurrence of each said word in each said text constituting said text base and a part, of each said text, of said cach occurrence;
    
    means for assigning each of possible parts of each said text a predetermined weight factor; and
    
    means operative for each said text for accumulating said predetermined weight factor associated with said part of said each occurrence of each said word to yield a weight by text to each said word, wherein said means for defining said degree of importance includes means for weighting each of said word occurrence counts with said weight by text.
- 40. A system as defined in claim 28, further comprising:
  - means, operative for each of texts constituting said text base, for managing each occurrence of each said word in each said text constituting said text base and a location, in each said text, of said each occurrence;
    
    means for calculating, for said each occurrence of each said word in each said text, a distance between said location and a location of each of keywords used in said query request;
    
    means for assigning each of predetermined distance ranges a predetermined weight factor; and
    
    means, operative for each of texts constituting said text base, for accumulating said predetermined weight factor associated with said distance for each said keyword for said each occurrence of each said word to yield a weight by texts to each said word, wherein said means for defining said degree of importance includes means for weighting each of said word occurrence counts with said weight by text.
- 41. A system as defined in claim 28, further comprising means for weighting said degree of importance with a weight associated with an attribute of each said word in said selected texts.
- 42. A system as defined in claim 28, further comprising means, responsive to a determination that any inclusion relation is found either in any two of said sorted words or between any of said sorted words and any of keywords used in said query request, for selecting one of two words involved in said inclusion relation on a basis of a predetermined criterion.
- 43. A system as defined in claim 42, further comprising means for setting said predetermined criterion for a comparison of lengths between said two words involved in said inclusion relation.
- 44. A system as defined in claim 42, further comprising means for setting said predetermined criterion for a comparison of degrees of importance between said two words involved in said inclusion relation.
- 45. A system as defined in claim 42, wherein said means for selecting one of two words includes means for selecting a shorter words and/or a difference between said two words.
- 46. A system as defined in claim 30, further comprising:
  - means, operative on a basis of keywords used in said query request and said list from said function, for sorting said list in order of degrees of congruity of said selected texts; and
    
    means for assigning each of said selected texts of said sorted list a predetermined weight, wherein said means for expressing said degree of importance includes means for weighting said word occurrence count WOr(Wj) with one of said predetermined weights associated with each said retrieved text RTr.
- 47. A system as defined in claim 28, further comprising means for classifying said sorted words by attributes of said sorted words into groups of similar keywords for display.
- 48. A system as defined in claim 28, further comprising means for classifying said sorted words by statistical data of said sorted words into groups of similar keywords for display.
- 49. A system as defined in claim 28, further comprising means for classifying said sorted words by a thesaurus into groups of similar keywords for display.
- 50. A system as defined in claim 47, further comprising means for displaying representative keywords in place of said groups.
- 51. A system as defined in claim 48, further comprising means for displaying representative keywords in place of said groups.
- 52. A system as defined in claim 49, further comprising means for displaying representative keywords in place of said groups.
- 53. A system as defined in claim 28, wherein said means for assisting said user includes means, responsive to a predetermined input from said user, for automatically generating said query request by using at least a part of said predetermined number of said related words.
- 54. A system as defined in claim 28, further comprising means for storing said predetermined number of said related words;
  - andmeans responsive to a predetermined input from said user for displaying said stored predetermined number of said related words.

55. A text retrieval system capable of assisting a user to search a text base by providing keywords on the basis of at least one preceding search, the text retrieval system comprising:
- a multiplicity of texts constituting said text base;
  
  means for managing attribute information on said texts constituting said text base;
  
  means, operative for each of texts constituting said text base, for managing local statistical information on words used in each said text;
  
  means for managing global statistical information on words used in any of said texts constituting said text base;
  
  means for permitting said user to issue a query request;
  
  means responsive to said query request for providing a list of text IDs of selected texts;
  
  means, operative for each of words contained in said selected texts listed in said selected text list, for calculating a degree of importance by using said local statistical information for said retrieval texts and said global statistical information;
  
  means for sorting said words contained in said selected texts in order of said degrees of importance;
  
  means for displaying a predetermined number of said sorted words with highest degrees of importance as related keywords; and
  
  means for assisting said user to enter a query request by using said related keywords.
- View Dependent Claims (56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75)
- - 56. A system as defined in claim 55, wherein said means for managing local statistical information includes a plurality of local statistical tables each associated with one of said texts constituting said text base, a local table associated with each said text containing a word ID of each of words used in each said text and a word occurrence count associated with said word ID, said word occurrence count indicating a number of occurrences, in each said text, of each said word used in each said text,
- 57. A system as defined in claim 56, wherein said degree of importance, I(Wj), is defined as:
  - $I (Wj) = C * \sum_{r = 1}^{R} {WOr (Wj) * IDF (Wj)} * RCT (Wj),$ where Wj is a word ID of each said word contained in said selected texts, C is a constant,WOr(Wj) is said word occurrence count of each said word Wj in each said retrieved text RTr, RCT(Wj) is a number of said selected texts which contain each said word Wj, and IDF(Wj) is said quantity, where RTr is a text ID of each said retrieved text and r=1, 2, . . . , R (R=a number of selected texts).
- 58. A system as defined in claim 55, further comprising:
  - means, responsive to a determination that a first query request and a second query request issued after said first one have resulted in a first list of first text IDs of first selected texts and a second list of second text IDs of second selected texts such that said second list is a subset of said first list, for calculating a distribution index for each said word contained in said first selected texts by using statistical information on words used in said first selected texts and statistical information on words used in said second selected texts, said distribution index being so defined that if each word is distributed in more of texts listed in said second list and distributed in less of said first selected texts, said index of the word becomes larger, and means for weighting said degree of importance with said distribution index.
- 59. A system as defined in claim 58, wherein said distribution index is expressed as {(MA/CTA(Wj))*(CTB(Wj)/MB)}, where MA and MB are numbers of texts listed in said list and said smaller list, respectively, and CTA(Wj) and CTB(Wj) are numbers of texts which are listed in said list and said smaller list, respectively, and which contain each said word Wj contained in said selected texts.
- 60. A system as defined in claim 56, wherein said list is sorted in order of degrees of congruity of said selected texts, wherein the system further comprises means for receiving said sorted list and assigning each of said selected texts of said sorted list a predetermined weight, and wherein said means for calculating said degree of importance includes means for weighting said word occurrence count with said predetermined weight.
- 61. A system as defined in claim 55, further comprising:
  - means, operative for each said word contained in said selected text, for making a test to see if a number of texts containing the word is within a predetermined range; and
    
    means, responsive to a determination that said word did not pass said test, for excluding said word from candidates of said related keywords.
- 62. A system as defined in claim 56, further comprising:
  - means, operative for each of texts constituting said text base, for managing each occurrence of each said word in each said text constituting said text base and a part, of each said text, of said each occurrence;
    
    means for assigning each of possible parts of each said test a predetermined weight factor; and
    
    means operative for each said text for accumulating said predetermined weight factor associated with said part of said each occurrence of each said word to yield a weight by text to each said word, wherein said means for defining said degree of importance includes means for weighting each of said word occurrence counts with said weight by text.
- 63. A system as defined in claim 55, further comprising:
  - means, operative for each of texts constituting said text base, for managing each occurrence of each said word in each said test constituting said text base and a location, in each said text, of said each occurrence;
    
    means for calculating, for said each occurrence of each said word in each said text, a distance between said location and a location of each of keywords used in said query request;
    
    means for assigning each of predetermined distance ranges a predetermined weight factor; and
    
    means, operative for each of texts constituting said test base, for accumulating said predetermined weight factor associated with said distance for each said keyword for said each occurrence of each said word to yield a weight by texts to each said word, wherein said means for defining said degree of importance includes means for weighting each of said word occurrence counts with said weight by text.
- 64. A system as defined in claim 55, further comprising means for weighting said degree of importance with a weight associated with an attribute of each said word in said selected texts.
- 65. A system as defined in claim 55, further comprising means, responsive to a determination that any inclusion relation is found either in any two of said sorted words or between any of said sorted words and any of keywords used in said query request, for selecting one of two words involved in said inclusion relation on a basis of a predetermined criterion.
- 66. A system as defined in claim 64, further comprising means for setting said predetermined criterion for a comparison of lengths between said two words involved in said inclusion relation.
- 67. A system as defined in claim 64, further comprising means for setting said predetermined criterion for a comparison of degrees of importance between said two words involved in said inclusion relation.
- 68. A system as defined in claim 64, wherein said means for selecting one of two words includes means for selecting a shorter words and/or a difference between said two words.
- 69. A system as defined in claim 57, further comprising:
  - means, operative on a basis of keywords used in said query request and said list from said function, for sorting said list in order of degrees of congruity of said selected texts; and
    
    means for assigning each of said selected texts of said sorted list a predetermined weight, wherein said means for expressing said degree of importance includes means for weighting said word occurrence count WOr(Wj) with one of said predetermined weights associated with each said retrieved text RTr.
- 70. A system as defined in claim 55, further comprising means for classifying said sorted words by attributes of said sorted words into groups of similar keywords for display.
- 71. A system as defined in claim 70, further comprising means for displaying representative keywords in place of said groups.
- 72. A system as defined in claim 55, wherein said means for assisting said user includes means, responsive to a predetermined input from said user, for automatically generating said query request by using at least a part of said predetermined number of said related words.
- 73. A system as defined in claim 55, further comprising means for storing said predetermined number of said related words;
  - andmeans responsive to a predetermined input from said user for displaying said stored predetermined number of said related words.
- 74. A system as defined in claim 55, further comprising a storage media drive adapted for a detachable mass storage medium, wherein said multiplicity of texts constituting said text base are stored in one of said detachable mass storage media.
- 75. A system as defined in claim 55, further comprising a two way communication means, wherein the system is distributed on a server and client system.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Panasonic Intellectual Property Corporation of America (Panasonic Holdings Corporation)
Original Assignee
Matsushita Electric Industrial Company Limited (Panasonic Holdings Corporation)
Inventors
Nomoto, Masako, Sato, Mitsuhiro, Noguchi, Naohiko, Kanno, Yuji, Inaba, Mitsuaki, Fukushige, Yoshio
Primary Examiner(s)
Kulik, Paul V.

Application Number

US09/106,748
Time in Patent Office

1,008 Days
Field of Search

707/2, 707/3, 707/4-6
US Class Current

707/749
CPC Class Codes

G06F 16/313   Selection or weighting of t...

Y10S 707/917   Text

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99935   Query augmenting and refini...

Keyword extracting system and text retrieval system using the same

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

158 Citations

75 Claims

Specification

Solutions

Use Cases

Quick Links

Keyword extracting system and text retrieval system using the same

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

158 Citations

75 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links