Document classification method and apparatus therefor
First Claim
Patent Images
1. A document classification system comprising:
- a category memory section storing document categories;
a word cluster distribution memory section storing word cluster distributions for each of the document categories;
a word distribution memory section storing classification words and classification word distributions for each of the word clusters;
a learning section connected to each said memory section that prepares the word cluster distributions in each of the document categories and provides the word cluster distributions to said word cluster distribution memory section, and that prepares the word distributions in each of the word clusters and provides the word distributions to said word distribution memory section; and
a document classification section that classifies a document based on linear combination models, there being one of the linear combination models for each of the document categories, each of the linear combination models linearly combining a respective one of the word distributions times a respective one of the word cluster distributions for each of the classification words in the document and has the form;
##EQU9## where P(W|c) is a probability that the document W is in the document category c, P(W|ki) is a probability of appearance of the classification word w in the word cluster ki, P(ki |c) is a probability of appearance of the word cluster ki in the document category c, and n is a number of the classification words in the document W.
1 Assignment
0 Petitions
Accused Products
Abstract
A document classification system and method classifies words into word clusters. Word clusters are arranged for categories of documents and a word falls into a word cluster at a probability. A linear combination model using a distribution of word clusters in a category and a distribution of words in the word cluster probabilistically indicates whether a document is in a particular category.
127 Citations
2 Claims
-
1. A document classification system comprising:
-
a category memory section storing document categories; a word cluster distribution memory section storing word cluster distributions for each of the document categories; a word distribution memory section storing classification words and classification word distributions for each of the word clusters; a learning section connected to each said memory section that prepares the word cluster distributions in each of the document categories and provides the word cluster distributions to said word cluster distribution memory section, and that prepares the word distributions in each of the word clusters and provides the word distributions to said word distribution memory section; and a document classification section that classifies a document based on linear combination models, there being one of the linear combination models for each of the document categories, each of the linear combination models linearly combining a respective one of the word distributions times a respective one of the word cluster distributions for each of the classification words in the document and has the form;
##EQU9## where P(W|c) is a probability that the document W is in the document category c, P(W|ki) is a probability of appearance of the classification word w in the word cluster ki, P(ki |c) is a probability of appearance of the word cluster ki in the document category c, and n is a number of the classification words in the document W.
-
-
2. A method of classifying a document into one of plural document categories comprising the steps of:
-
identifying classification words in each word cluster in each of the plural document categories; determining word cluster distributions for each of the document categories from a learning set of documents; determining classification word distributions for each of the word clusters from the learning set of documents; compiling a linear combination model for each of the document categories, each of the linear combination models linearly combining a respective one of the word distributions times a respective one of the word cluster distributions for each of the classification words in a document being classified and has the form;
##EQU10## where P(W|c) is a probability that the document W is in the document category c, P(W|ki) is a probability of appearance of the classification word w in the word cluster ki, P(ki |c) is a probability of appearance of the word cluster ki in the document category c, and n is a number of the classification words in the document W; andcomparing the compiled linear combination models to determine the one of the document categories for the document being classified.
-
Specification