Method and platform for term extraction from large collection of documents
First Claim
1. A method for statistically extracting terms from a set of documents comprising the steps of:
- determining for each document in the set of documents an importance vector based on importance values for words in each document;
forming a binary document classification tree by clustering the documents into clusters of similar documents based on the importance vector for each of the documents;
building an infrastructure for the set of documents by generalizing the binary document classification tree;
analyzing distribution of the importance vectors to determine a cutting threshold;
cutting the generalized tree of the infrastructure according to the cutting threshold to further cluster the documents;
extracting statistically significant individual key words from the further clusters of similar documents; and
extracting terms using the key words as seeds, wherein the cutting threshold is defined by S1*, that satisfies the equation;
2 Assignments
0 Petitions
Accused Products
Abstract
A method and platform for statistically extracting terms from large sets of documents is described. An importance vector is determined for each document in the set of documents based on importance values for words in each document. A binary document classification tree is formed by clustering the documents into clusters of similar documents based on the importance vector for each document. An infrastructure is built for the set of documents by generalizing the binary document classification tree. The document clusters are determined by dividing the generalized tree of the infrastructure into two parts and cutting away the upper part. Statistically significant individual key words are extracted from the clusters of similar documents. Key words are treated as seeds and terms are extracted by starting from the seeds and extending to their left or right contexts.
-
Citations
22 Claims
-
1. A method for statistically extracting terms from a set of documents comprising the steps of:
-
determining for each document in the set of documents an importance vector based on importance values for words in each document; forming a binary document classification tree by clustering the documents into clusters of similar documents based on the importance vector for each of the documents; building an infrastructure for the set of documents by generalizing the binary document classification tree; analyzing distribution of the importance vectors to determine a cutting threshold; cutting the generalized tree of the infrastructure according to the cutting threshold to further cluster the documents; extracting statistically significant individual key words from the further clusters of similar documents; and extracting terms using the key words as seeds, wherein the cutting threshold is defined by S1*, that satisfies the equation; - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A computing platform for statistically extracting terms from a set of documents comprising:
-
means for determining for each document in the set of documents an importance vector based on importance values for words in each document; means for forming a binary document classification tree by clustering the documents into clusters of similar documents based on the importance vector for each of the documents; means for building an infrastructure for the set of documents by generalizing the binary document classification tree; means for analyzing distribution of the importance vectors to determine a cutting threshold; means for dividing the generalized tree of the infrastructure into two parts according to the cutting threshold to define further document clusters; means for extracting statistically significant individual key words from the further clusters of similar documents; and means for extracting terms using the key words as seeds wherein the cutting threshold is defined by S1*, that satisfies the equation; - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21)
-
-
22. A method for statistically extracting terms from a set of documents comprising the steps of:
-
determining for each document in the set of documents an importance vector based on importance values for words in each document; forming a binary document classification tree by clustering the documents into clusters of similar documents based on the importance vector for each of the documents; building an infrastructure for the set of documents by generalizing the binary document classification tree; cutting the generalized tree of the infrastructure to further cluster the documents; extracting statistically significant individual key words from the further clusters of similar documents; and
extracting terms using the key words as seeds;
wherein the generalizing steps includes the steps of;clustering similarity values of the documents into nodes forming a binary weight classification tree; analyzing the distribution of similarity values to determine a cutting level; dividing the binary weight classification tree in accordance with the cutting level to determine clusters of nodes forming the binary weight classification tree; and merging nodes of the binary document classification tree having similarity values within the clusters of nodes forming the binary weight classification tree, wherein, wherein the cutting level is defined by S1* that satisfies the equation;
-
Specification