Test classification system and method

US 6,137,911 A
Filed: 06/16/1997
Issued: 10/24/2000
Est. Priority Date: 06/16/1997
Status: Expired due to Fees

First Claim

Patent Images

1. A method of automatically classifying a text entity which comprises a plurality of terms into one or more clusters of a plurality of clusters which characterize a corpus of text in corresponding subject areas, each cluster having a plurality of text entities related to a particular corresponding subject area, the method comprising forming a list of terms sorted by order of occurrence from the corpus;

determining, for each of the clusters, a value of statistical weight of significance of terms of the list in said each cluster by examining distributions of the terms inside of the cluster and outside of the cluster, said determining comprising calculating a weight of significance of terns in said each cluster, and assigning a weight of zero to terms which are not statistically significant in said each cluster;

constructing a vector for each cluster, the vector having element values corresponding to the weights of significance of the terms in the cluster;

calculating for each cluster from its corresponding vector statistical signatures of the cluster;

determining from the statistical signatures a score for the text entity for each cluster indicating the relevance of the text entity to the cluster; and

classifying the text entity into one or more clusters based upon said scores.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Documents are classified into one or more clusters corresponding to predefined classification categories by building a knowledge base comprising matrices of vectors which indicate the significance of terms within a corpus of text formed by the documents and classified in the knowledge base to each cluster. The significance of terms is determined assuming a standard normal probability distribution, and terms are determined to be significant to a cluster if their probability of occurrence being due to chance is low. For each cluster, statistical signatures comprising sums of weighted products and intersections of cluster terms to corpus terms are generated and used as discriminators for classifying documents. The knowledge base is built using prefix and suffix lexical rules which are context-sensitive and applied selectively to improve the accuracy and precision of classification.

532 Citations

11 Claims

1. A method of automatically classifying a text entity which comprises a plurality of terms into one or more clusters of a plurality of clusters which characterize a corpus of text in corresponding subject areas, each cluster having a plurality of text entities related to a particular corresponding subject area, the method comprising forming a list of terms sorted by order of occurrence from the corpus;
- determining, for each of the clusters, a value of statistical weight of significance of terms of the list in said each cluster by examining distributions of the terms inside of the cluster and outside of the cluster, said determining comprising calculating a weight of significance of terns in said each cluster, and assigning a weight of zero to terms which are not statistically significant in said each cluster;
  
  constructing a vector for each cluster, the vector having element values corresponding to the weights of significance of the terms in the cluster;
  
  calculating for each cluster from its corresponding vector statistical signatures of the cluster;
  
  determining from the statistical signatures a score for the text entity for each cluster indicating the relevance of the text entity to the cluster; and
  
  classifying the text entity into one or more clusters based upon said scores.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein said calculating said weight of significance comprises calculating a standard normal variable of a standard normal probability distribution for each term;
    - and wherein said assigning comprises using as said weights calculated values of the standard normal variable which correspond to a low probability of occurrence of a term in a cluster due to a random distribution, and otherwise assigning to a term a weight of zero.
  - 3. The method of claim 1, wherein said statistical signatures comprise a sum of weighted products corresponding to the products of the number of occurrences of terms in the text entity and the corresponding weights of significance of the terms in the cluster, and an intersection sum corresponding to the number of occurrences in the text entity of terms having weights of significance greater than a predetermined threshold.
  - 4. The method of claim 3, wherein said predetermined threshold is selected to be zero such that the intersection represents terms which have positive significance in said text entity.
  - 5. The method of claim 4 further comprising determining using said intersection sum whether the occurrence of terms in a cluster having positive significance is due to chance by determining the value of a standard normal variable which characterizes the intersection of terms in the cluster, and assigning the text entity to the cluster when the value of said standard normal variable exceeds a predetermined value.
  - 6. The method of claim 1 further comprising forming a matrix of weights of significance of each term in each cluster;
    - comparing pairs of terms to generate similarity values for each term, the similarity values measuring semantic similarity between terms;
      
      selecting for each term a list of champions by rank ordering terms which are semantically close to said term according to similarity value, and selecting as champions a number of terms which exceed a predetermined number.
  - 7. The method of claim 6, wherein said terms comprise words, and the method further comprises generating lexical rules by identifying sequences of characters in a first part of a word which match sequences of characters in words in said list of champions, equating non-matching sequences of characters in words, rank ordering said equated non-matching sequences by frequency of occurrence to generate lexical rule candidates, and selecting as lexical rules a predetermined number of rules from said rank ordering.
  - 8. The method of claim 7 further comprising applying said lexical rules to words in said corpus to generate lists of inflections, said inflections comprising words having the same meaning;
    - assigning a common identifier to all words in an inflection list which have the same meaning;
      
      substituting for each word in said corpus its corresponding identifier;
      
      inspecting pairs of identifiers corresponding to adjacent words to generate phrases;
      
      determining the weight of significance of the phrases in each cluster; and
      
      using said weights of significance of phrases to calculate said statistical signatures of the clusters.

9. A method of automatically classifying a document which comprises a plurality of words and phrases into one or more clusters of a plurality of clusters which characterize a corpus of text in corresponding subject areas, each cluster having a plurality of documents related to a particular corresponding subject area, the method comprising calculating, for each of the clusters, values of a statistical weight of significance of distributions of the words and phrases in the cluster and in a complement of the cluster, and assigning a value of zero to the weights of words and phrases which are not statistically significant in the cluster;
- calculating using the values of the weights of significance of the words and phrases in each cluster statistical signatures of the cluster, said statistical signatures comprising sums of weighted products and intersections of words and phrases in the cluster;
  
  determining from the statistical signatures cluster scores for the document representing the relevance of the document to each cluster; and
  
  classifying the document into one or more clusters based upon said scores.
- View Dependent Claims (10)
- - 10. The method of claim 9, wherein said calculating values of statistical weight of significance comprises calculating values of a standard normal variable of a standard normal probability distribution for each word and phrase, and said assigning comprises assigning a value of zero to the weights of words and phrases which are not statistically significant in a cluster, and wherein said sum of weighted products comprises the products of the numbers of occurrences of words and phrases in the document and the corresponding weights of significance of the words and phrases, and the intersection sum comprises the number of occurrences in the document of words and phrases having weights of significance greater than a predetermined threshold.

11. A system for automatically classifying a text entity which comprises a plurality of terms into one or more clusters of a plurality of clusters which characterize a corpus of text in corresponding subject areas, each cluster having a plurality of text entities related to a particular corresponding subject area, the system comprising a classifier having means for determining for a selected term in a text entity to be classified and for each cluster a probability distribution of the selected term in the cluster and in a complement of the cluster;
- means for assigning a weight of zero to terms which are not statistically significant in the cluster;
  
  means for calculating a statistical score for the cluster from the non-zero weights of significance of terms in the cluster; and
  
  means for classifying the text entity into one or more clusters based upon said score.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
APR Smartlogik Limited
Original Assignee
SLG Realisations Plc (Progress Software Corporation)
Inventors
Zhilyaev, Maxim
Primary Examiner(s)
Boudreau, Leo H.
Assistant Examiner(s)
Mariam, Daniel G.

Application Number

US08/876,271
Time in Patent Office

1,226 Days
Field of Search

382/224, 382/225, 382/226, 382/228, 382/229, 382/230, 345/440
US Class Current

382/225
CPC Class Codes

G06F 16/353 into predefined classes

G06F 18/00 Pattern recognition

Test classification system and method

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

532 Citations

11 Claims

Specification

Solutions

Use Cases

Quick Links

Test classification system and method

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

532 Citations

11 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links