New differential LSI space-based probabilistic document classifier
First Claim
Patent Images
1. A method of setting up a DLSI space-based classifier for document classification comprising the steps of:
- preprocessing documents to distinguish terms of a word and a noun phrase from stop words;
constructing system terms by setting up a term list as well as global weights;
normalizing document vectors of collected documents, as well as centroid vectors of each cluster;
constructing a differential term by intra-document matrix DIm×
nI, such that each column in said matrix is a differential intra-document vector;
decomposing the differential term by intra-document matrix DI, by an SVD algorithm, into DI=UISIVIT(SI=diag(δ
I,1,δ
I,2, . . . )), followed by a composition of DI,kI=UkISkIVkIT giving an approximate DI in terms of an appropriate kI;
setting up a likelihood function of intra-differential document vector;
constructing a term by extra-document matrix DEm×
nE, such that each column of said extra-document matrix is an extra-differential document vector;
decomposing DE, by exploiting the SVD algorithm, into DE=UESEVET(SE=diag(δ
E,1,δ
E,2, . . . )), then with a proper kE, defining DE,kE=UkESkEVkET to approximate DE;
setting up a likelihood function of extra-differential document vector;
setting up a posteriori function; and
using the DLSI space-based classifier to automatically classify a document.
1 Assignment
0 Petitions
Accused Products
Abstract
A method for automatic document classification based on a combined use of the projection and the distance of the differential document vectors to the differential latent semantics index (DLSI) spaces. The method includes the setting up of a DLSI space-based classifier and the use of such classifier to evaluate the possibility of a document belonging to a given cluster using a posteriori probability function.
224 Citations
9 Claims
-
1. A method of setting up a DLSI space-based classifier for document classification comprising the steps of:
-
preprocessing documents to distinguish terms of a word and a noun phrase from stop words;
constructing system terms by setting up a term list as well as global weights;
normalizing document vectors of collected documents, as well as centroid vectors of each cluster;
constructing a differential term by intra-document matrix DIm×
nI, such that each column in said matrix is a differential intra-document vector;
decomposing the differential term by intra-document matrix DI, by an SVD algorithm, into DI=UISIVIT(SI=diag(δ
I,1,δ
I,2, . . . )), followed by a composition of DI,kI =UkI SkI VkI T giving an approximate DI in terms of an appropriate kI;
setting up a likelihood function of intra-differential document vector;
constructing a term by extra-document matrix DEm×
nE, such that each column of said extra-document matrix is an extra-differential document vector;
decomposing DE, by exploiting the SVD algorithm, into DE=UESEVET(SE=diag(δ
E,1,δ
E,2, . . . )), then with a proper kE, defining DE,kE =UkE SkE VkE T to approximate DE;
setting up a likelihood function of extra-differential document vector;
setting up a posteriori function; and
using the DLSI space-based classifier to automatically classify a document.
-
-
2. An automatic document classification method using a DLSI space-based classifier for classifying a document in accordance with clusters in a database, comprising the steps of:
-
a) setting up a document vector by generating terms as well as frequencies of occurrence of said terms in the document, so that a normalized document vector N is obtained for the document;
b) constructing, using the document to be classified, a differential document vector x=N−
C, where C is the normalized vector giving a center or centroid of a cluster;
c) calculating an intra-document likelihood function P(x|DI) for the document;
d) calculating an extra-document likelihood function P(x|DE) for the document;
e) calculating a Bayesian posteriori probability function P(DI|x);
f) repeating, for each of the clusters of the data base, steps b-e;
g) selecting a cluster having a largest P(DI|x) as the cluster to which the document most likely belongs; and
h) classifying the document in the selected cluster. - View Dependent Claims (3)
-
-
4. A method of setting up a DLSI space-based classifier for document classification, comprising the steps of:
-
setting up a differential term by intra-document matrix where each column of the matrix denotes a difference between a document and a centroid of a cluster to which the document belongs;
decomposing the differential term by intra-document matrix by an SVD algorithm to identify an intra-DLSI space;
setting up a probability function for a differential document vector being a differential intra-document vector;
calculating the probability function according to projection and distance from the differential document vector to the intra-DLSI space;
setting up a differential term by extra-document matrix where each column of the matrix denotes a differential document vector between a document vector and a centroid vector of a cluster which does not include the document;
decomposing the differential term by extra-document matrix by an SVD algorithm to identify an extra-DLSI space;
setting up a probability function for a differential document vector being a differential extra-document vector;
setting up a posteriori likelihood function using the differential intra-document and differential extra-document vectors to provide a most probable similarity measure of a document belonging to a cluster; and
using the DLSI space-based classifier to automatically classify a document. - View Dependent Claims (5, 6, 7, 8, 9)
-
Specification