Method for performing effective drill-down operations in text corpus visualization and exploration using language model approaches for key phrase weighting

US 20080243482A1
Filed: 05/04/2007
Published: 10/02/2008
Est. Priority Date: 03/28/2007
Status: Abandoned Application

First Claim

Patent Images

1. A method for performing a drill-down operation on a text corpus comprising documents using language models for key phrase weighting, said method comprising the steps of:

(a) weighting key phrases occurring both in a foreground language model, which contains a selected document cluster of said text corpus, and in a background language model, which does not contain said selected document cluster, by calculating for each key phrase a key phrase weight comprising a ratio between the foreground weight of said key phrase and a background weight of said key phrase; and

(b) assigning documents of the foreground language model to cluster labels which are formed by key phrases having high calculated key phrase weights.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The invention relates to a method and an apparatus for performing a drill-down operation on a text corpus comprising documents, using language models for key phrase weighting, said method comprising the steps of weighting key phrases occurring both in a foreground language model, which contains a selected document cluster of said text corpus, and in a background language model, which does not contain said selected document cluster, by calculating for each key phrase a key phrase weight comprising a ratio between the foreground weight of said key phrase and a background weight of said key phrase, and assigning documents of the foreground language model to cluster labels which are formed by key phrases having high calculated key phrase weights.

15 Citations

View as Search Results

24 Claims

1. A method for performing a drill-down operation on a text corpus comprising documents using language models for key phrase weighting, said method comprising the steps of:
- (a) weighting key phrases occurring both in a foreground language model, which contains a selected document cluster of said text corpus, and in a background language model, which does not contain said selected document cluster, by calculating for each key phrase a key phrase weight comprising a ratio between the foreground weight of said key phrase and a background weight of said key phrase; and
  
  (b) assigning documents of the foreground language model to cluster labels which are formed by key phrases having high calculated key phrase weights.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The method according to claim 1,wherein the foreground weight of said key phrase in the documents of the foreground language model, which contains said selected document cluster, and the background weight of said key phrase in the documents of the background language model, which does not contain said selected document cluster, are both calculated according to a predetermined weighting scheme.
  - 3. The method according to claim 2,wherein the weighting scheme comprisesa TF/IDF weighting scheme,an informativeness/phraseness measurement weighting scheme,a binomial log-likelihood ratio test weighting scheme (BLRT),a CHI Square-weighting scheme,a student'"'"'s t-test weighting scheme ora Kullback-Leibler divergence weighting scheme.
  - 4. The method according to claim 3,wherein the foreground weight of the key phrase is calculated by using a TF/IDF weighting scheme depending on a term frequency (TF) and an inverse document frequency (IDF) of said key phrase in the documents of the foreground language model which contains said selected document cluster.
  - 5. The method according to claim 3,wherein the background weight of the key phrase is calculated by using a TF/IDF weighting scheme depending on a term frequency (TF) and an inverse document frequency (IDF) of said key phrase in the document of the background language model which does not contain said selected document cluster.
  - 6. The method according to claim 1,wherein the key phrase weight w(k) is calculated by:
    - w(k)=└
      
      w_fg(k)/w_bg(k)┘
      
      ·
      
      log └
      
      w_fg(k)+w_bg(k)┘
      
      ,wherein w_fgis a foreground weight of said key phrase (k) and,wherein w_bgis the background weight of said key phrase (k).
  - 7. The method according to claim 1,wherein the key phrase weight w(k) is calculated by:
    - w(k)=log └
      
      w_fg(k)/w_bg(k)┘
      
      ·
      
      log └
      
      w_fg(k)+w_bg(k)┘
      
      ,wherein w_fgis the foreground weight of said key phrase (k) and,wherein w_bgis the background weight of said key phrase (k).
  - 8. The method according to claim 1,wherein the key phrase weight w(k) is calculated by:
    - $w (k) = \frac{w_{fg} (k)}{w_{bg} (k)},$ wherein w_fgis the foreground weight of said key phrase (k) and, wherein w_bgis the background weight of said key phrase (k).
  - 9. The method according to claim 1,wherein the key phrase weight w(k) is calculated by:
    - $w (k) = \log [\frac{w_{fg} (k)}{w_{bg} (k)}],$ wherein w_fgis the foreground weight of said key phrase (k) and, wherein w_bgis the background weight of said key phrase (k).
  - 10. The method according to claim 1,wherein the text corpus is a monolingual text corpus or a multilingual text corpus.
  - 11. The method according to claim 2,wherein said weighting scheme for calculation of said foreground weight and of said background weight of a key phrase in a document weights also said key phrase depending on whether it is a meta tag, a key phrase within a title of said document, a key phrase within an abstract of said document or a key phrase in a text of said document.
  - 12. The method according to claim 1,wherein the document is an HTML-document.
  - 13. The method according to claim 1,wherein the cluster labels of the document clusters are displayed for selection of the corresponding document clusters on a screen.
  - 14. The method according to claim 13,wherein the selection of the corresponding document cluster is performed by a user.
  - 15. The method according to claim 13,wherein the documents of the selected document cluster are displayed to the user on said screen.

16. A method for performing a drill-down operation on a text corpus comprising documents using language models for key phrase weighting comprising the steps of:
- (a) clustering said text corpus into clusters each including a set of documents;
  
  (b) selecting a cluster from among the clusters to generate a foreground language model containing the selected document cluster and a background language model which does not contain the selected document cluster;
  
  (c) weighting key phrases occurring both in the foreground language model and in the background language model by calculating for each key phrase a key phrase weight comprising a ratio between a foreground weight of said key phrase and -a background weight of said key phrase;
  
  (d) sorting the weighted key phrases according to the respective key phrase weight in descending order;
  
  (e) weighting a configurable number of key phrases having a high key phrase weight as cluster label; and
  
  (f) assigning documents of a foreground language model to the selected cluster labels.
- View Dependent Claims (17, 18)
- - 17. The method according to claim 16,wherein the selected cluster labels are displayed on a screen for selection of subclusters.
  - 18. The method according to claim 17,wherein the selection of the subclusters is performed by a user.

19. A user terminal for performing a drill-down operation on a text corpus comprising documents stored in at least one data base using language models for key phrase weighting, said user terminal comprising:
- (a) a screen for displaying cluster labels of selectable document clusters each including a set of documents;
  
  (b) a calculation unit for weighting key phrases occurring both in a foreground language model, which contains a selected document cluster of said text corpus, and in a background language model, which does not contain said selected document cluster, by calculating for each key phrase a key phrase weight comprising a ratio between a foreground weight of said key phrase and a background weight of said key phrase and for assigning documents of said foreground language model to cluster labels which are formed by key phrases having high calculated key phrase weights.
- View Dependent Claims (20, 21, 22)
- - 20. The user terminal according to claim 19,wherein the user terminal is connected via a network to said data base.
  - 21. The user terminal according to claim 20,wherein the network is a local network.
  - 22. The user terminal according to claim 20,wherein the network is formed by the Internet.

23. An apparatus for performing a drill-down operation on a text corpus comprising documents using language models for key phrase weighting, said apparatus comprising:
- (a) means for weighting a key phrase occurring both in a foreground language model, which contains a selected document cluster of said text corpus, and in a background language model, which does not contain a selected document cluster, by calculating for each key phrase a key phrase weight comprising a ratio between a foreground weight of said key phrase and a background weight of said key phrase; and
  
  (b) means for assigning documents of the foreground language model to cluster labels which are formed by key phrases having high calculated key phrase weights.

24. An apparatus for performing a drill-down operation on a text corpus comprising documents using language models for key phrase weighting,wherein said apparatus comprises:
- (a) means for clustering said text corpus into clusters each including a set of documents;
  
  (b) means for selecting a cluster from among the clusters to generate a foreground language model which contains the selected document cluster and a background language model which does not contain the selected document cluster;
  
  (c) means for weighting key phrases occurring both in the foreground language model and in the background language model by calculating for each key phrase a key phrase weight comprising a ratio between a foreground weight of said key phrase and a background weight of said key phrase;
  
  (d) means for sorting the weighted key phrases according to the key phrase weight;
  
  (e) means for selecting a configurable number of key phrases having the highest key phrase weight as cluster labels; and
  
  (f) means for assigning documents of the foreground language model to the selected cluster labels.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Siemens AG
Original Assignee
Siemens AG
Inventors
Skubacz, Michal, Ziegler, Cai-Nicolas

Application Number

US11/797,632
Publication Number

US 20080243482A1
Time in Patent Office

Days
Field of Search
US Class Current

704/9
CPC Class Codes

G06F 40/30 Semantic analysis

Method for performing effective drill-down operations in text corpus visualization and exploration using language model approaches for key phrase weighting

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

15 Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Method for performing effective drill-down operations in text corpus visualization and exploration using language model approaches for key phrase weighting

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

15 Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links