System and method for automatically extracting interesting phrases in a large dynamic corpus
First Claim
1. A method of automatically extracting a plurality of interesting phrases in a corpus, comprising:
- generating a plurality of tokens by tokenizing the corpus and expanding abbreviations as directed by a dictionary, combining the tokens into compound tokens as directed by the dictionary;
forming candidate N-token phrases from the tokens and the compound tokens;
accumulating an occurrence count for at least some of the candidate N-token phrases;
pruning the candidate N-token phrases by applying a pruning threshold;
merging overlapping candidate N-token phrases;
adjusting an occurrence count of each of the candidate N-token phrases to account for any one or more of a sub-phrase, a plural, or a possessive; and
ordering the candidate N-token phrases according to a score, and selecting the interesting phrases as the highest ranking candidate N-token phrases.
1 Assignment
0 Petitions
Accused Products
Abstract
A phrase extraction system combines a dictionary method, a statistical/heuristic approach, and a set of pruning steps to extract frequently occurring and interesting phrases from a corpus. The system finds the “top k” phrases in a corpus, where k is an adjustable parameter. For a time-varying corpus, the system uses historical statistics to extract new and increasingly frequent phrases. The system finds interesting phrases that occur near a set of user-designated phrases. The system uses these designated phrases as anchor phrases to identify phrases that occur near the anchor phrases. The system finds frequently occurring and interesting phrases in a time-varying corpus is changing in time, as in finding frequent phrases in an on-going, long term document feed or continuous, regular web crawl.
-
Citations
20 Claims
-
1. A method of automatically extracting a plurality of interesting phrases in a corpus, comprising:
-
generating a plurality of tokens by tokenizing the corpus and expanding abbreviations as directed by a dictionary, combining the tokens into compound tokens as directed by the dictionary;
forming candidate N-token phrases from the tokens and the compound tokens;
accumulating an occurrence count for at least some of the candidate N-token phrases;
pruning the candidate N-token phrases by applying a pruning threshold;
merging overlapping candidate N-token phrases;
adjusting an occurrence count of each of the candidate N-token phrases to account for any one or more of a sub-phrase, a plural, or a possessive; and
ordering the candidate N-token phrases according to a score, and selecting the interesting phrases as the highest ranking candidate N-token phrases. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A computer program product comprising a computer usable medium having computer usable program codes for automatically extracting a plurality of interesting phrases in a corpus, the computer program product comprising:
-
computer usable program code for generating a plurality of tokens by tokenizing the corpus and expanding abbreviations as directed by a dictionary, computer usable program code for combining the tokens into compound tokens as directed by the dictionary;
computer usable program code for forming candidate N-token phrases from the tokens and the compound tokens;
computer usable program code for accumulating an occurrence count for at least some of the candidate N-token phrases;
computer usable program code for pruning the candidate N-token phrases by applying a pruning threshold;
computer usable program code for merging overlapping candidate N-token phrases;
computer usable program code for adjusting an occurrence count of each of the candidate N-token phrases to account for any one or more of a sub-phrase, a plural, or a possessive; and
computer usable program code for ordering the candidate N-token phrases according to a score, and selecting the interesting phrases as the highest ranking candidate N-token phrases. - View Dependent Claims (17, 18, 19)
-
-
20. A system for automatically extracting a plurality of interesting phrases in a corpus, comprising:
-
a tokenizer for generating a plurality of tokens by tokenizing the corpus and expanding abbreviations as directed by a dictionary, a token combiner for combining the tokens into compound tokens as directed by the dictionary;
an token phrase counter for forming candidate N-token phrases from the tokens and the compound tokens, and for accumulating an occurrence count for at least some of the candidate N-token phrases;
a pruner for pruning the candidate N-token phrases by applying a pruning threshold;
a merger for merging overlapping candidate N-token phrases;
a count adjuster for adjusting an occurrence count of each of the candidate N-token phrases to account for any one or more of a sub-phrase, a plural, or a possessive; and
a phrase selector ordering the candidate N-token phrases according to a score, and for selecting the interesting phrases as the highest ranking candidate N-token phrases.
-
Specification