Information management and retrieval
First Claim
1. Apparatus for managing data sets, having:
- input means for receiving a data set;
means to identify, within a received data set, a first set of words comprising one or more word groups of one or more words, conforming to a first predetermined distribution pattern within said received data set, wherein said words in said word groups occur consecutively in said received data set;
means to identify, within said first set, a sub-set of words comprising one or more of said word groups, conforming to a second predetermined distribution pattern within said received data set;
means to eliminate said sub-set of words from said first set thereby forming a set of key terms of said received data set; and
output means for outputting at least one said key term.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and apparatus is provided for extracting key terms from a data set, the method includes identifying a first set of one or more word groups of one or more word that occur more than once in the data set, and removing from this first set a second set of word groups that are sub-strings of longer word groups in the first set. The remaining word groups are key terms. Each word group is weighted according to its frequency of occurrence within the data set. The weighting of any word group may be increased by the frequency of any sub-string of words occurring in the second set and then dividing each weighting by the number of words in the word group. This weighting process operates to determine the order of occurrence of the word groups. Prefixes and suffixes are also removed from each word in the data set. This produces a neutral form of each word so that the weighting values are prefix and suffix independent.
136 Citations
26 Claims
-
1. Apparatus for managing data sets, having:
-
input means for receiving a data set;
means to identify, within a received data set, a first set of words comprising one or more word groups of one or more words, conforming to a first predetermined distribution pattern within said received data set, wherein said words in said word groups occur consecutively in said received data set;
means to identify, within said first set, a sub-set of words comprising one or more of said word groups, conforming to a second predetermined distribution pattern within said received data set;
means to eliminate said sub-set of words from said first set thereby forming a set of key terms of said received data set; and
output means for outputting at least one said key term. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
means for modifying said word groups, arranged to remove low value words occurring before the first high value word in a word group and arranged to remove low value words occurring after the last high value word in a word group.
-
-
5. Apparatus as in claim 4 including:
means for modifying any word in any word group, arranged to remove any prefix and arranged to remove any suffix from a word to form a stemmed word.
-
6. Apparatus as in claim 5 including:
means for storing said prefix or suffix in association with said stemmed word thereby enabling said prefix or suffix to be restored to said stemmed word.
-
7. Apparatus as in claim 1 wherein said second distribution pattern requires that each word group in said sub-set comprises a word or a string of words that occurs within a larger word group in said first set.
-
8. Apparatus as in claim 7 including:
-
means for weighting each said word group in said first set according to how frequently each said word group occurs in said received data set;
means for modifying said weighting of at least a first word group in said first set in proportion to a weighting of a second word group in said sub-set; and
means for selecting said key terms for output in dependence upon said weightings.
-
-
9. Apparatus as in claim 8 further comprising:
means for selecting key terms for output in dependence upon said weightings and at least one predetermined rule.
-
10. Apparatus as in claim 1 including:
means for modifying said word groups, arranged to remove low value words occurring before the first high value word in a word group and arranged to remove low value words occurring after the last high value word in a word group.
-
11. Apparatus as in claim 1 including:
means for modifying any word in any word group, arranged to remove any prefix and arranged to remove any suffix from a word to form a stemmed word.
-
12. Apparatus as in claim 11 including:
means for storing said prefix or suffix in association with said stemmed word thereby enabling said prefix or suffix to be restored to said stemmed word.
-
13. Apparatus as in claim 1 including:
-
means for weighting each said word group in said first set according to how frequently each said word group occurs in said received data set;
means for modifying said weighting of at least a first word group in said first set in proportion to a weighting of a second word group in said sub-set; and
means for selected said key terms for output in dependence upon said weightings.
-
-
14. Apparatus as in claim 1 further comprising:
means for selecting key terms for output in dependence upon said weightings and at least one predetermined rule.
-
15. A method of managing data sets, said method including:
-
1) receiving a data set as input;
2) identifying a first set of words conforming to a first distribution pattern within said data set, said first set comprising one or more word groups of one or more words, wherein said words in said word groups occur consecutively in said data set;
3) identifying a sub-set of word groups in said first set, said sub-set conforming to a second distribution pattern within said data-set;
4) eliminating said sub-set from said first set thereby identifying a set of key terms;
5) outputting said key terms. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 24, 25, 26)
6) removing any low value word occurring before the first high value word in a word group and removing any low value word occurring after the last high value word in a word group.
-
-
19. A method as in claim 18 including:
7) modifying any word in any said word group by removing a prefix or suffix from the word thereby forming a stemmed word.
-
20. A method as in claim 19 including:
8) storing said removed prefix or suffix in association with said stemmed word thereby enabling said prefix or suffix to be restored to said stemmed word.
-
21. A method as in claim 20 including the steps of:
-
9) weighting each word group in said first set according to how frequently each said word group occurs in said data set;
10) modifying said weightings of at least a first word group in said first set in proportion to a weighting of a second word group in said sub-set;
11) selecting said key terms for output in dependence upon said weightings.
-
-
22. A method as in claim 15 wherein said second distribution pattern requires that each said word group of said sub-set comprises a sub-string of a longer word group in said first set.
-
24. A method as in claim 15, including:
7) modifying any word in any said word group by removing a prefix or suffix from the word thereby forming a stemmed word.
-
25. A method as in claim 24, including:
8) storing said removed prefix or suffix in association with said stemmed word thereby enabling said prefix or suffix to be restored to said stemmed word.
-
26. A method as in claim 15, including the steps of:
-
9) weighting each word group in said first set according to how frequently each said word group occurs in said data set;
10) modifying said weightings of at least a first word group in said first set in proportion to a weighting of a second word group in said sub-set;
11) selecting said key terms for output in dependence upon said weightings.
-
-
23. A method as in claims including:
6) removing any low value word occurring before the first high value word in a word group and removing any low value word occurring after the last high value word in a word group.
Specification