Modeling topics using statistical distributions

US 9,317,593 B2
Filed: 10/01/2008
Issued: 04/19/2016
Est. Priority Date: 10/05/2007
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

accessing a corpus stored in one or more tangible media, the corpus comprising a plurality of documents, a document comprising a plurality of words;

selecting one or more words of each document as one or more keywords of the each document;

clustering the documents according to the keywords to yield a plurality of clusters, each cluster corresponding to a different topic;

generating a statistical distribution for each cluster from a subset of the words of the documents of the each cluster to yield a plurality of statistical distributions, wherein generating the statistical distribution for the each cluster comprises;

determining a co-occurrence value indicating a co-occurrence of the topic of the each cluster with the topics of the other clusters in the plurality of documents; and

generating a co-occurrence distribution from the co-occurrence values;

modeling each topic using the statistical distribution generated for the cluster corresponding to the each topic;

organizing the clusters according to the statistical distributions; and

assigning the topics of the organized clusters to the documents in the organized clusters.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In one embodiment, modeling topics includes accessing a corpus comprising documents that include words. Words of a document are selected as keywords of the document. The documents are clustered according to the keywords to yield clusters, where each cluster corresponds to a topic. A statistical distribution is generated for a cluster from words of the documents of the cluster. A topic is modeled using the statistical distribution generated for the cluster corresponding to the topic.

Citations

12 Claims

1. A computer-implemented method comprising:
- accessing a corpus stored in one or more tangible media, the corpus comprising a plurality of documents, a document comprising a plurality of words;
  
  selecting one or more words of each document as one or more keywords of the each document;
  
  clustering the documents according to the keywords to yield a plurality of clusters, each cluster corresponding to a different topic;
  
  generating a statistical distribution for each cluster from a subset of the words of the documents of the each cluster to yield a plurality of statistical distributions, wherein generating the statistical distribution for the each cluster comprises;
  
  determining a co-occurrence value indicating a co-occurrence of the topic of the each cluster with the topics of the other clusters in the plurality of documents; and
  
  generating a co-occurrence distribution from the co-occurrence values;
  
  modeling each topic using the statistical distribution generated for the cluster corresponding to the each topic;
  
  organizing the clusters according to the statistical distributions; and
  
  assigning the topics of the organized clusters to the documents in the organized clusters.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, the selecting the one or more words of the each document further comprising:
    - ranking the words of the each document according to a ranking technique; and
      
      selecting one or more highly ranked words as the one or more keywords.
  - 3. The method of claim 1, the clustering the documents according to the keywords to yield the plurality of clusters further comprising:
    - removing one or more clusters that fail to satisfy a size threshold.
  - 4. The method of claim 1, the generating the statistical distribution for the each cluster further comprising:
    - calculating a term frequency of each word of the subset of the words to yield a plurality of term frequencies; and
      
      generating a term distribution from the term frequencies.
  - 5. The method of claim 1, the generating the statistical distribution for the each cluster further comprising:
    - calculating a number of documents that include each word of the subset of the words; and
      
      generating a term distribution from the numbers of documents.
  - 6. The method of claim 1, further comprising:
    - identifying at least two clusters with similar statistical distributions; and
      
      consolidating the at least two clusters.

7. One or more non-transitory computer-readable tangible media encoding software operable when executed to:
- access a corpus stored in one or more tangible media, the corpus comprising a plurality of documents, a document comprising a plurality of words;
  
  select one or more words of each document as one or more keywords of the each document;
  
  cluster the documents according to the keywords to yield a plurality of clusters, each cluster corresponding to a different topic;
  
  generate a statistical distribution for each cluster from a subset of the words of the documents of the each cluster to yield a plurality of statistical distributions, wherein generating the statistical distribution for the each cluster comprises;
  
  determining a co-occurrence value indicating a co-occurrence of the topic of the each cluster with the topics of the other clusters in the plurality of documents; and
  
  generating a co-occurrence distribution from the co-occurrence values; and
  
  model each topic using the statistical distribution generated for the cluster corresponding to the each topic;
  
  organize the clusters according to the statistical distributions; and
  
  assign the topics of the organized clusters to the documents in the organized clusters.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The computer-readable tangible media of claim 7, further operable to select the one or more words of the each document by:
    - ranking the words of the each document according to a ranking technique; and
      
      selecting one or more highly ranked words as the one or more keywords.
  - 9. The computer-readable tangible media of claim 7, further operable to cluster the documents according to the keywords to yield the plurality of clusters by:
    - removing one or more clusters that fail to satisfy a size threshold.
  - 10. The computer-readable tangible media of claim 7, further operable to generate the statistical distribution for the each cluster by:
    - calculating a term frequency of each word of the subset of the words to yield a plurality of term frequencies; and
      
      generating a term distribution from the term frequencies.
  - 11. The computer-readable tangible media of claim 7, further operable to generate the statistical distribution for the each cluster by:
    - calculating a number of documents that include each word of the subset of the words; and
      
      generating a term distribution from the numbers of documents.
  - 12. The computer-readable tangible media of claim 7, further operable to:
    - identify at least two clusters with similar statistical distributions; and
      
      consolidate the at least two clusters.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Fujitsu Limited
Original Assignee
Fujitsu Limited
Inventors
Marvit, David L., Jain, Jawahar, Stergiou, Stergios, Gilman, Alex, Adler, B. Thomas, Sidorowich, John J., Labrou, Yannis
Primary Examiner(s)
TRAN, ANHTAI V

Application Number

US12/243,267
Publication Number

US 20090094233A1
Time in Patent Office

2,757 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 16/313 Selection or weighting of t...

G06F 16/355 Class or cluster creation o...

Modeling topics using statistical distributions

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

12 Claims

Specification

Solutions

Use Cases

Quick Links

Modeling topics using statistical distributions

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

12 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links