Differential LSI space-based probabilistic document classifier
First Claim
1. A method of setting up a DLSI space-based classifier to be stored in a computer storage device for document classification using a computer, and using said classifier by said computer to classify a document according to a plurality of clusters within a database, comprising the steps of:
- preprocessing documents using said computer to distinguish terms of a word and a noun phrase from stop words;
constructing system terms by setting up a term list as well as global weights using said computer;
normalizing document vectors of collected documents, as well as centroid vectors of each cluster using said computer;
constructing a differential term by intra-document matrix DIm×
nI using said computer, such that each column in said matrix is a differential intra-document vector;
decomposing the differential term by intra-document matrix DI, by an SVD algorithm using said computer, into DI=UISIVIT(SI=diag(δ
I,1,δ
I,2, . . . )), followed by a composition of DI,kI=UkISkIVkIT giving an approximate DI in terms of an appropriate kI;
setting up a likelihood function of intra-differential document vector using said computer;
constructing a term by extra-document matrix DEm×
nE using said computer, such that each column of said extra-document matrix is an extra-differential document vector;
decomposing DE, by exploiting the SVD algorithm using said computer, into DE=UESEVET(SE=diag(δ
E,1,δ
E,2, . . . )), then with a proper kE, defining DE,kE=UkESkEVkET to approximate DE;
setting up a likelihood function of extra-differential document vector using said computer;
setting up a posteriori function using said computer; and
said computer using said DLSI space-based classifier, as set up in the foregoing steps, to classify the document as belonging to one of the plurality of clusters within the database.
1 Assignment
0 Petitions
Accused Products
Abstract
A computerized method for automatic document classification based on a combined use of the projection and the distance of the differential document vectors to the differential latent semantics index (DLSI) spaces. The method includes the setting up of a DLSI space-based classifier to be stored in computer storage and the use of such classifier by a computer to evaluate the possibility of a document belonging to a given cluster using a posteriori probability function and to classify the document in the cluster. The classifier is effective in operating on very large numbers of documents such as with document retrieval systems over a distributed computer network.
66 Citations
9 Claims
-
1. A method of setting up a DLSI space-based classifier to be stored in a computer storage device for document classification using a computer, and using said classifier by said computer to classify a document according to a plurality of clusters within a database, comprising the steps of:
-
preprocessing documents using said computer to distinguish terms of a word and a noun phrase from stop words; constructing system terms by setting up a term list as well as global weights using said computer; normalizing document vectors of collected documents, as well as centroid vectors of each cluster using said computer; constructing a differential term by intra-document matrix DIm×
nI using said computer, such that each column in said matrix is a differential intra-document vector;decomposing the differential term by intra-document matrix DI, by an SVD algorithm using said computer, into DI=UISIVIT(SI=diag(δ
I,1,δ
I,2, . . . )), followed by a composition of DI,kI =UkI SkI VkI T giving an approximate DI in terms of an appropriate kI;setting up a likelihood function of intra-differential document vector using said computer; constructing a term by extra-document matrix DEm×
nE using said computer, such that each column of said extra-document matrix is an extra-differential document vector;decomposing DE, by exploiting the SVD algorithm using said computer, into DE=UESEVET(SE=diag(δ
E,1,δ
E,2, . . . )), then with a proper kE, defining DE,kE =UkE SkE VkE T to approximate DE;setting up a likelihood function of extra-differential document vector using said computer; setting up a posteriori function using said computer; and said computer using said DLSI space-based classifier, as set up in the foregoing steps, to classify the document as belonging to one of the plurality of clusters within the database.
-
-
2. An automatic document classification method using a DLSI space-based classifier operating on a computer as a computerized classifier to classify a document in accordance with clusters in a database, comprising the steps of:
-
a) setting up, by said computerized classifier, a document vector by generating terms as well as frequencies of occurrence of said terms in the document, so that a normalized document vector N is obtained for the document; b) constructing, by said computerized classifier, using the document to be classified, a differential document vector x=N−
C, where C is the normalized vector giving a center or centroid of a cluster;c) calculating an intra-document likelihood function P(x|DI) for the document using said computerized classifier; d) calculating an extra-document likelihood function P(x|DE) for the document using said computerized classifier; e) calculating a Bayesian posteriori probability function using said computerized classifier P(DI|x); f) repeating, for each of the clusters of the data base, steps b–
e;g) selecting, by said computerized classifier, a cluster having a largest P(DI|x) as the cluster to which the document most likely belongs; and h) classifying, by said computerized classifier, the document in the selected cluster within said database thereby categorizing said document for an automated document retrieval system. - View Dependent Claims (3)
-
-
4. A method of setting up a DLSI space-based classifier to be stored in a computer storage device for document classification using a computer, and using said classifier by said computer to classify a document according to a plurality of clusters within a database, comprising the steps of:
-
setting up, using said computer, a differential term by intra-document matrix where each column of the matrix denotes a difference between a document and a centroid of a cluster to which the document belongs; decomposing, using said computer, the differential term by intra-document matrix by an SVD algorithm to identify an intra-DLSI space; setting up, using said computer, a probability function for a differential document vector being a differential intra-document vector; calculating, using said computer, the probability function according to projection and distance from the differential document vector to the intra-DLSI space; setting up, using said computer, a differential term by extra-document matrix where each column of the matrix denotes a differential document vector between a document vector and a centroid vector of a cluster which does not include the document; decomposing, using said computer, the differential term by extra-document matrix by an SVD algorithm to identify an extra-DLSI space; setting up, using said computer, a probability function for a differential document vector being a differential extra-document vector; setting up, using said computer, a posteriori likelihood function using the differential intra-document and differential extra-document vectors to provide a most probable similarity measure of a document belonging to a given cluster; and said computer using said DLSI space-based classifier, as set up in the foregoing steps, to classify the document as belonging to one of the plurality of clusters within the database. - View Dependent Claims (5, 6, 7, 8, 9)
-
Specification