Method and platform for term extraction from large collection of documents

US 7,395,256 B2
Filed: 06/20/2003
Issued: 07/01/2008
Est. Priority Date: 06/20/2003
Status: Expired due to Fees

First Claim

Patent Images

1. A method for statistically extracting terms from a set of documents comprising the steps of:

determining for each document in the set of documents an importance vector based on importance values for words in each document;

forming a binary document classification tree by clustering the documents into clusters of similar documents based on the importance vector for each of the documents;

building an infrastructure for the set of documents by generalizing the binary document classification tree;

analyzing distribution of the importance vectors to determine a cutting threshold;

cutting the generalized tree of the infrastructure according to the cutting threshold to further cluster the documents;

extracting statistically significant individual key words from the further clusters of similar documents; and

extracting terms using the key words as seeds, wherein the cutting threshold is defined by S₁*, that satisfies the equation;

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and platform for statistically extracting terms from large sets of documents is described. An importance vector is determined for each document in the set of documents based on importance values for words in each document. A binary document classification tree is formed by clustering the documents into clusters of similar documents based on the importance vector for each document. An infrastructure is built for the set of documents by generalizing the binary document classification tree. The document clusters are determined by dividing the generalized tree of the infrastructure into two parts and cutting away the upper part. Statistically significant individual key words are extracted from the clusters of similar documents. Key words are treated as seeds and terms are extracted by starting from the seeds and extending to their left or right contexts.

Citations

22 Claims

1. A method for statistically extracting terms from a set of documents comprising the steps of:
- determining for each document in the set of documents an importance vector based on importance values for words in each document;
  
  forming a binary document classification tree by clustering the documents into clusters of similar documents based on the importance vector for each of the documents;
  
  building an infrastructure for the set of documents by generalizing the binary document classification tree;
  
  analyzing distribution of the importance vectors to determine a cutting threshold;
  
  cutting the generalized tree of the infrastructure according to the cutting threshold to further cluster the documents;
  
  extracting statistically significant individual key words from the further clusters of similar documents; and
  
  extracting terms using the key words as seeds, wherein the cutting threshold is defined by S₁*, that satisfies the equation;
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, wherein the importance value for words in each document is determined based on the frequency and saliency of the words in the documents with respect to a reference corpus.
  - 3. The method of claim 1, wherein the documents are clustered by performing the steps of:
    - selecting two of the documents having the greatest similarity;
      
      merging the two documents into a new document;
      
      computing an importance vector for the new document; and
      
      again selecting and merging two of the documents or new documents having the biggest similarity.
  - 4. The method of claim 3, wherein the similarity of the documents is determined by the cosine of their importance vectors.
  - 5. The method of claim 1, wherein the generalizing step merges similar nodes together.
  - 6. The method of claim 1, wherein the generalizing step includes the steps of:
    - clustering similarity values of the documents into nodes forming a binary weight classification tree;
      
      dividing the binary weight classification tree to determine clusters of nodes forming the binary weight classification tree; and
      
      merging nodes of the binary document classification tree having similarity values within the clusters of nodes forming the binary weight classification tree.
  - 7. The method of claim 6, wherein the dividing step includes the steps of:
    - analyzing the distribution of similarity values to determine a cutting level; and
      
      dividing the binary weight classification tree in accordance with the cutting level.
  - 8. The method of claim 1, wherein the step of cutting the generalized binary tree is performed by dividing the tree into one part having the same root as the generalized binary tree and another part having a forest of document clusters of the similar documents of the generalized tree.
  - 9. The method of claim 8, wherein words in X*, which satisfies the equation:
  - 10. The method of claim 1, wherein mutual information is used to determine terms based on the seeds.
  - 11. The method of claim 10, wherein a word string t is determined to be a term if it meets requirements including having at least one of the seeds.

12. A computing platform for statistically extracting terms from a set of documents comprising:
- means for determining for each document in the set of documents an importance vector based on importance values for words in each document;
  
  means for forming a binary document classification tree by clustering the documents into clusters of similar documents based on the importance vector for each of the documents;
  
  means for building an infrastructure for the set of documents by generalizing the binary document classification tree;
  
  means for analyzing distribution of the importance vectors to determine a cutting threshold;
  
  means for dividing the generalized tree of the infrastructure into two parts according to the cutting threshold to define further document clusters;
  
  means for extracting statistically significant individual key words from the further clusters of similar documents; and
  
  means for extracting terms using the key words as seeds wherein the cutting threshold is defined by S₁*, that satisfies the equation;
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21)
- - 13. The platform of claim 12, wherein the importance value for words in each document is determined based on the frequency and saliency of the words in the documents with respect to a reference corpus.
  - 14. The platform of claim 12, wherein the means for clustering the documents includes:
    - means for selecting two of the documents having the biggest similarity;
      
      means for merging the two documents into a new document;
      
      means for computing an importance vector for the new document; and
      
      means for again selecting and merging two of the documents or new documents having the biggest similarity.
  - 15. The platform of claim 14, wherein the similarity of the documents is determined by the cosine of their importance vectors.
  - 16. The platform of claim 12, wherein the generalizing means merges similar nodes together.
  - 17. The platform of claim 12, wherein the generalizing means includes:
    - means for clustering similarity values of the documents into nodes forming a binary weight classification tree;
      
      means for dividing the binary weight classification tree to determine clusters of nodes forming the binary weight classification tree; and
      
      means for merging nodes of the binary document classification tree having similarity values within the clusters of nodes forming the binary weight classification tree.
  - 18. The platform of claim 12, wherein the step of dividing the generalized tree is performed by dividing the tree into one part having the same root as the generalized binary tree and another part having a forest of document clusters of the similar documents of the generalized tree.
  - 19. The method of claim 18, wherein words in X*, which satisfies the equation:
  - 20. The method of claim 12, wherein mutual information is used to determine terms based on the seeds.
  - 21. The method of claim 20, wherein a word string t is determined to be a term if it meets requirements including having at least one of the seeds.

22. A method for statistically extracting terms from a set of documents comprising the steps of:
- determining for each document in the set of documents an importance vector based on importance values for words in each document;
  
  forming a binary document classification tree by clustering the documents into clusters of similar documents based on the importance vector for each of the documents;
  
  building an infrastructure for the set of documents by generalizing the binary document classification tree;
  
  cutting the generalized tree of the infrastructure to further cluster the documents;
  
  extracting statistically significant individual key words from the further clusters of similar documents; and
  
  extracting terms using the key words as seeds;
  
  wherein the generalizing steps includes the steps of;
  
  clustering similarity values of the documents into nodes forming a binary weight classification tree;
  
  analyzing the distribution of similarity values to determine a cutting level;
  
  dividing the binary weight classification tree in accordance with the cutting level to determine clusters of nodes forming the binary weight classification tree; and
  
  merging nodes of the binary document classification tree having similarity values within the clusters of nodes forming the binary weight classification tree, wherein, wherein the cutting level is defined by S₁* that satisfies the equation;

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Agency For Science Technology and Research (Government of Singapore)
Original Assignee
Agency For Science Technology and Research (Government of Singapore)
Inventors
Yang, Lingpeng, Nie, Yu, Ji, Donghong
Primary Examiner(s)
Mofiz; Apu
Assistant Examiner(s)
Padmanabhan; Kavita

Application Number

US10/465,567
Publication Number

US 20040267709A1
Time in Patent Office

1,838 Days
Field of Search

None
US Class Current

707/737
CPC Class Codes

G06F 16/313   Selection or weighting of t...

G06F 16/355   Class or cluster creation o...

Y10S 707/966   Distributed

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99942   Manipulating data structure...

Method and platform for term extraction from large collection of documents

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Method and platform for term extraction from large collection of documents

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links