System and method for automatically extracting interesting phrases in a large dynamic corpus

US 20070067157A1
Filed: 09/22/2005
Published: 03/22/2007
Est. Priority Date: 09/22/2005
Status: Abandoned Application

First Claim

Patent Images

1. A method of automatically extracting a plurality of interesting phrases in a corpus, comprising:

generating a plurality of tokens by tokenizing the corpus and expanding abbreviations as directed by a dictionary, combining the tokens into compound tokens as directed by the dictionary;

forming candidate N-token phrases from the tokens and the compound tokens;

accumulating an occurrence count for at least some of the candidate N-token phrases;

pruning the candidate N-token phrases by applying a pruning threshold;

merging overlapping candidate N-token phrases;

adjusting an occurrence count of each of the candidate N-token phrases to account for any one or more of a sub-phrase, a plural, or a possessive; and

ordering the candidate N-token phrases according to a score, and selecting the interesting phrases as the highest ranking candidate N-token phrases.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A phrase extraction system combines a dictionary method, a statistical/heuristic approach, and a set of pruning steps to extract frequently occurring and interesting phrases from a corpus. The system finds the “top k” phrases in a corpus, where k is an adjustable parameter. For a time-varying corpus, the system uses historical statistics to extract new and increasingly frequent phrases. The system finds interesting phrases that occur near a set of user-designated phrases. The system uses these designated phrases as anchor phrases to identify phrases that occur near the anchor phrases. The system finds frequently occurring and interesting phrases in a time-varying corpus is changing in time, as in finding frequent phrases in an on-going, long term document feed or continuous, regular web crawl.

Citations

20 Claims

1. A method of automatically extracting a plurality of interesting phrases in a corpus, comprising:
- generating a plurality of tokens by tokenizing the corpus and expanding abbreviations as directed by a dictionary, combining the tokens into compound tokens as directed by the dictionary;
  
  forming candidate N-token phrases from the tokens and the compound tokens;
  
  accumulating an occurrence count for at least some of the candidate N-token phrases;
  
  pruning the candidate N-token phrases by applying a pruning threshold;
  
  merging overlapping candidate N-token phrases;
  
  adjusting an occurrence count of each of the candidate N-token phrases to account for any one or more of a sub-phrase, a plural, or a possessive; and
  
  ordering the candidate N-token phrases according to a score, and selecting the interesting phrases as the highest ranking candidate N-token phrases.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The method of claim 1, wherein the corpus is static.
  - 3. The method of claim 2, wherein the score includes an occurrence count of the candidate N-token phrases.
  - 4. The method of claim 1, wherein the corpus is time-variable.
  - 5. The method of claim 4, wherein the score includes an occurrence count of the candidate N-token phrases, which is determined over preceding n intervals of time.
  - 6. The method of claim 1, further comprising:
    - selecting anchor phrases; and
      
      identifying anchor tokens corresponding to the selected anchor phrases.
  - 7. The method of claim 6, further comprising disambiguating the anchor tokens by identifying desired anchor tokens through context.
  - 8. The method of claim 6, wherein forming the candidate N-token phrases comprising forming the candidate N-token phrases within a predetermined vicinity of an anchor phrase using anchor tokens as delimiter.
  - 9. The method of claim 8, wherein the vicinity of the anchor phrase comprises a predetermined window.
  - 10. The method of claim 8, wherein the vicinity of the anchor phrase comprises a sentence.
  - 11. The method of claim 8, wherein the vicinity of the anchor phrase comprises a paragraph.
  - 12. The method of claim 8, wherein the vicinity of the anchor phrase comprises a markup tag.
  - 13. The method of claim 8, wherein accumulating the occurrence count comprises accumulating a local occurrence count for each candidate N-token phrase occurring within the vicinity of the anchor token.
  - 14. The method of claim 13, further comprising computing a global occurrence count for candidate N-token phrases over the corpus.
  - 15. The method of claim 14, wherein the score comprises the local occurrence count and the global occurrence count.

16. A computer program product comprising a computer usable medium having computer usable program codes for automatically extracting a plurality of interesting phrases in a corpus, the computer program product comprising:
- computer usable program code for generating a plurality of tokens by tokenizing the corpus and expanding abbreviations as directed by a dictionary, computer usable program code for combining the tokens into compound tokens as directed by the dictionary;
  
  computer usable program code for forming candidate N-token phrases from the tokens and the compound tokens;
  
  computer usable program code for accumulating an occurrence count for at least some of the candidate N-token phrases;
  
  computer usable program code for pruning the candidate N-token phrases by applying a pruning threshold;
  
  computer usable program code for merging overlapping candidate N-token phrases;
  
  computer usable program code for adjusting an occurrence count of each of the candidate N-token phrases to account for any one or more of a sub-phrase, a plural, or a possessive; and
  
  computer usable program code for ordering the candidate N-token phrases according to a score, and selecting the interesting phrases as the highest ranking candidate N-token phrases.
- View Dependent Claims (17, 18, 19)
- - 17. The computer program product of claim 16, wherein the corpus is static.
  - 18. The computer program product of claim 17, wherein the score includes an occurrence count of the candidate N-token phrases.
  - 19. The computer program product of claim 16, wherein the corpus is time-variable.

20. A system for automatically extracting a plurality of interesting phrases in a corpus, comprising:
- a tokenizer for generating a plurality of tokens by tokenizing the corpus and expanding abbreviations as directed by a dictionary, a token combiner for combining the tokens into compound tokens as directed by the dictionary;
  
  an token phrase counter for forming candidate N-token phrases from the tokens and the compound tokens, and for accumulating an occurrence count for at least some of the candidate N-token phrases;
  
  a pruner for pruning the candidate N-token phrases by applying a pruning threshold;
  
  a merger for merging overlapping candidate N-token phrases;
  
  a count adjuster for adjusting an occurrence count of each of the candidate N-token phrases to account for any one or more of a sub-phrase, a plural, or a possessive; and
  
  a phrase selector ordering the candidate N-token phrases according to a score, and for selecting the interesting phrases as the highest ranking candidate N-token phrases.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Zhang, Zengyan, Novak, Jasmine, Niblack, Carlton, Kurita, Keiko, Kaku, Vinay

Application Number

US11/234,667
Publication Number

US 20070067157A1
Time in Patent Office

Days
Field of Search
US Class Current

704/10
CPC Class Codes

G06F 40/289 Phrasal analysis, e.g. fini...

System and method for automatically extracting interesting phrases in a large dynamic corpus

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for automatically extracting interesting phrases in a large dynamic corpus

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links