Information management and retrieval

US 6,338,057 B1
Filed: 12/07/1998
Issued: 01/08/2002
Est. Priority Date: 11/24/1997
Status: Expired due to Term

First Claim

Patent Images

1. Apparatus for managing data sets, having:

input means for receiving a data set;

means to identify, within a received data set, a first set of words comprising one or more word groups of one or more words, conforming to a first predetermined distribution pattern within said received data set, wherein said words in said word groups occur consecutively in said received data set;

means to identify, within said first set, a sub-set of words comprising one or more of said word groups, conforming to a second predetermined distribution pattern within said received data set;

means to eliminate said sub-set of words from said first set thereby forming a set of key terms of said received data set; and

output means for outputting at least one said key term.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus is provided for extracting key terms from a data set, the method includes identifying a first set of one or more word groups of one or more word that occur more than once in the data set, and removing from this first set a second set of word groups that are sub-strings of longer word groups in the first set. The remaining word groups are key terms. Each word group is weighted according to its frequency of occurrence within the data set. The weighting of any word group may be increased by the frequency of any sub-string of words occurring in the second set and then dividing each weighting by the number of words in the word group. This weighting process operates to determine the order of occurrence of the word groups. Prefixes and suffixes are also removed from each word in the data set. This produces a neutral form of each word so that the weighting values are prefix and suffix independent.

136 Citations

26 Claims

1. Apparatus for managing data sets, having:
- input means for receiving a data set;
  
  means to identify, within a received data set, a first set of words comprising one or more word groups of one or more words, conforming to a first predetermined distribution pattern within said received data set, wherein said words in said word groups occur consecutively in said received data set;
  
  means to identify, within said first set, a sub-set of words comprising one or more of said word groups, conforming to a second predetermined distribution pattern within said received data set;
  
  means to eliminate said sub-set of words from said first set thereby forming a set of key terms of said received data set; and
  
  output means for outputting at least one said key term.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. Apparatus as in claim 1 wherein said first distribution pattern requires that each word group in said first set occurs at least twice in said received data set.
  - 3. Apparatus as in claim 2 wherein said second distribution pattern requires that each word group in said sub-set comprises a word or a string of words that occurs within a larger word group in said first set.
  - 4. Apparatus as in claim 3 including:
5. Apparatus as in claim 4 including:
- means for modifying any word in any word group, arranged to remove any prefix and arranged to remove any suffix from a word to form a stemmed word.
6. Apparatus as in claim 5 including:
- means for storing said prefix or suffix in association with said stemmed word thereby enabling said prefix or suffix to be restored to said stemmed word.
7. Apparatus as in claim 1 wherein said second distribution pattern requires that each word group in said sub-set comprises a word or a string of words that occurs within a larger word group in said first set.
8. Apparatus as in claim 7 including:
- means for weighting each said word group in said first set according to how frequently each said word group occurs in said received data set;
  
  means for modifying said weighting of at least a first word group in said first set in proportion to a weighting of a second word group in said sub-set; and
  
  means for selecting said key terms for output in dependence upon said weightings.
9. Apparatus as in claim 8 further comprising:
- means for selecting key terms for output in dependence upon said weightings and at least one predetermined rule.
10. Apparatus as in claim 1 including:
- means for modifying said word groups, arranged to remove low value words occurring before the first high value word in a word group and arranged to remove low value words occurring after the last high value word in a word group.
11. Apparatus as in claim 1 including:
- means for modifying any word in any word group, arranged to remove any prefix and arranged to remove any suffix from a word to form a stemmed word.
12. Apparatus as in claim 11 including:
- means for storing said prefix or suffix in association with said stemmed word thereby enabling said prefix or suffix to be restored to said stemmed word.
13. Apparatus as in claim 1 including:
- means for weighting each said word group in said first set according to how frequently each said word group occurs in said received data set;
  
  means for modifying said weighting of at least a first word group in said first set in proportion to a weighting of a second word group in said sub-set; and
  
  means for selected said key terms for output in dependence upon said weightings.
14. Apparatus as in claim 1 further comprising:
- means for selecting key terms for output in dependence upon said weightings and at least one predetermined rule.

15. A method of managing data sets, said method including:
- 1) receiving a data set as input;
  
  2) identifying a first set of words conforming to a first distribution pattern within said data set, said first set comprising one or more word groups of one or more words, wherein said words in said word groups occur consecutively in said data set;
  
  3) identifying a sub-set of word groups in said first set, said sub-set conforming to a second distribution pattern within said data-set;
  
  4) eliminating said sub-set from said first set thereby identifying a set of key terms;
  
  5) outputting said key terms.
- View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 24, 25, 26)
- - 16. A method as in claim 15 wherein said first distribution pattern requires that each said word group in said first set occurs more than once in said data set.
  - 17. A method as in claim 16 wherein said second distribution pattern requires that each said word group of said sub-set comprises a sub-string of a longer word group in said first set.
  - 18. A method as in claim 17 including:
19. A method as in claim 18 including:
- 7) modifying any word in any said word group by removing a prefix or suffix from the word thereby forming a stemmed word.
20. A method as in claim 19 including:
- 8) storing said removed prefix or suffix in association with said stemmed word thereby enabling said prefix or suffix to be restored to said stemmed word.
21. A method as in claim 20 including the steps of:
- 9) weighting each word group in said first set according to how frequently each said word group occurs in said data set;
  
  10) modifying said weightings of at least a first word group in said first set in proportion to a weighting of a second word group in said sub-set;
  
  11) selecting said key terms for output in dependence upon said weightings.
22. A method as in claim 15 wherein said second distribution pattern requires that each said word group of said sub-set comprises a sub-string of a longer word group in said first set.
24. A method as in claim 15, including:
- 7) modifying any word in any said word group by removing a prefix or suffix from the word thereby forming a stemmed word.
25. A method as in claim 24, including:
- 8) storing said removed prefix or suffix in association with said stemmed word thereby enabling said prefix or suffix to be restored to said stemmed word.
26. A method as in claim 15, including the steps of:
- 9) weighting each word group in said first set according to how frequently each said word group occurs in said data set;
  
  10) modifying said weightings of at least a first word group in said first set in proportion to a weighting of a second word group in said sub-set;
  
  11) selecting said key terms for output in dependence upon said weightings.

23. A method as in claims including:
- 6) removing any low value word occurring before the first high value word in a word group and removing any low value word occurring after the last high value word in a word group.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
British Telecommunications PLC (BT Group PLC)
Original Assignee
British Telecommunications PLC (BT Group PLC)
Inventors
Weeks, Richard
Primary Examiner(s)
Black, Thomas
Assistant Examiner(s)
CHEUNG, MARY DA ZHI WANG

Application Number

US09/194,944
Time in Patent Office

1,128 Days
Field of Search

707/3, 707/4, 707/5, 707/6, 704/1, 704/2, 704/3, 704/4, 704/7, 704/10
US Class Current

707/736
CPC Class Codes

G06F 16/30 of unstructured textual dat...

Y10S 707/99933 Query processing, i.e. sear...

Information management and retrieval

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

136 Citations

26 Claims

Specification

Solutions

Use Cases

Quick Links

Information management and retrieval

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

136 Citations

26 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links