×

System and method for automatically extracting interesting phrases in a large dynamic corpus

  • US 20070067157A1
  • Filed: 09/22/2005
  • Published: 03/22/2007
  • Est. Priority Date: 09/22/2005
  • Status: Abandoned Application
First Claim
Patent Images

1. A method of automatically extracting a plurality of interesting phrases in a corpus, comprising:

  • generating a plurality of tokens by tokenizing the corpus and expanding abbreviations as directed by a dictionary, combining the tokens into compound tokens as directed by the dictionary;

    forming candidate N-token phrases from the tokens and the compound tokens;

    accumulating an occurrence count for at least some of the candidate N-token phrases;

    pruning the candidate N-token phrases by applying a pruning threshold;

    merging overlapping candidate N-token phrases;

    adjusting an occurrence count of each of the candidate N-token phrases to account for any one or more of a sub-phrase, a plural, or a possessive; and

    ordering the candidate N-token phrases according to a score, and selecting the interesting phrases as the highest ranking candidate N-token phrases.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×