Test classification system and method
First Claim
1. A method of automatically classifying a text entity which comprises a plurality of terms into one or more clusters of a plurality of clusters which characterize a corpus of text in corresponding subject areas, each cluster having a plurality of text entities related to a particular corresponding subject area, the method comprising forming a list of terms sorted by order of occurrence from the corpus;
- determining, for each of the clusters, a value of statistical weight of significance of terms of the list in said each cluster by examining distributions of the terms inside of the cluster and outside of the cluster, said determining comprising calculating a weight of significance of terns in said each cluster, and assigning a weight of zero to terms which are not statistically significant in said each cluster;
constructing a vector for each cluster, the vector having element values corresponding to the weights of significance of the terms in the cluster;
calculating for each cluster from its corresponding vector statistical signatures of the cluster;
determining from the statistical signatures a score for the text entity for each cluster indicating the relevance of the text entity to the cluster; and
classifying the text entity into one or more clusters based upon said scores.
5 Assignments
0 Petitions
Accused Products
Abstract
Documents are classified into one or more clusters corresponding to predefined classification categories by building a knowledge base comprising matrices of vectors which indicate the significance of terms within a corpus of text formed by the documents and classified in the knowledge base to each cluster. The significance of terms is determined assuming a standard normal probability distribution, and terms are determined to be significant to a cluster if their probability of occurrence being due to chance is low. For each cluster, statistical signatures comprising sums of weighted products and intersections of cluster terms to corpus terms are generated and used as discriminators for classifying documents. The knowledge base is built using prefix and suffix lexical rules which are context-sensitive and applied selectively to improve the accuracy and precision of classification.
532 Citations
11 Claims
-
1. A method of automatically classifying a text entity which comprises a plurality of terms into one or more clusters of a plurality of clusters which characterize a corpus of text in corresponding subject areas, each cluster having a plurality of text entities related to a particular corresponding subject area, the method comprising forming a list of terms sorted by order of occurrence from the corpus;
- determining, for each of the clusters, a value of statistical weight of significance of terms of the list in said each cluster by examining distributions of the terms inside of the cluster and outside of the cluster, said determining comprising calculating a weight of significance of terns in said each cluster, and assigning a weight of zero to terms which are not statistically significant in said each cluster;
constructing a vector for each cluster, the vector having element values corresponding to the weights of significance of the terms in the cluster;
calculating for each cluster from its corresponding vector statistical signatures of the cluster;
determining from the statistical signatures a score for the text entity for each cluster indicating the relevance of the text entity to the cluster; and
classifying the text entity into one or more clusters based upon said scores. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- determining, for each of the clusters, a value of statistical weight of significance of terms of the list in said each cluster by examining distributions of the terms inside of the cluster and outside of the cluster, said determining comprising calculating a weight of significance of terns in said each cluster, and assigning a weight of zero to terms which are not statistically significant in said each cluster;
-
9. A method of automatically classifying a document which comprises a plurality of words and phrases into one or more clusters of a plurality of clusters which characterize a corpus of text in corresponding subject areas, each cluster having a plurality of documents related to a particular corresponding subject area, the method comprising calculating, for each of the clusters, values of a statistical weight of significance of distributions of the words and phrases in the cluster and in a complement of the cluster, and assigning a value of zero to the weights of words and phrases which are not statistically significant in the cluster;
- calculating using the values of the weights of significance of the words and phrases in each cluster statistical signatures of the cluster, said statistical signatures comprising sums of weighted products and intersections of words and phrases in the cluster;
determining from the statistical signatures cluster scores for the document representing the relevance of the document to each cluster; and
classifying the document into one or more clusters based upon said scores. - View Dependent Claims (10)
- calculating using the values of the weights of significance of the words and phrases in each cluster statistical signatures of the cluster, said statistical signatures comprising sums of weighted products and intersections of words and phrases in the cluster;
-
11. A system for automatically classifying a text entity which comprises a plurality of terms into one or more clusters of a plurality of clusters which characterize a corpus of text in corresponding subject areas, each cluster having a plurality of text entities related to a particular corresponding subject area, the system comprising a classifier having means for determining for a selected term in a text entity to be classified and for each cluster a probability distribution of the selected term in the cluster and in a complement of the cluster;
- means for assigning a weight of zero to terms which are not statistically significant in the cluster;
means for calculating a statistical score for the cluster from the non-zero weights of significance of terms in the cluster; and
means for classifying the text entity into one or more clusters based upon said score.
- means for assigning a weight of zero to terms which are not statistically significant in the cluster;
Specification