Method and system for classifying semi-structured documents
First Claim
1. A classifier, for use on a computer readable medium, for dynamically classifying a semi-structured document with a schema, comprising:
- a vectorization module for parsing the document into a structured vector model, wherein the structured vector model is divided into a tree of sub-vectors to reflect a plurality of hierarchical levels beginning with a root and ending with a plurality of leaves;
a sorting module for searching the, document and for counting the occurrences of individual terms in the document;
the sorting module further accounting for the frequency of occurrence of the terms at each hierarchical level to achieve a high contextual sensitivity;
a testing module for assigning a class to the document by using a statistical model based on probability calculation to create a classification model; and
wherein the classification model assigns a class to the document that maximizes a posteriori class probability Pr[c|d,Fk] according to the following expression;
where d is the document, p(c) is a prior distribution on the class c;
c′
is a class in a set of documents;
pd is a path to a structure node ed from a root;
n is a number of occurrences of term t in pd, f is a maximum likelihood estimation;
Fk is a set of selected terms;
F is a Fisher index defined by the following equation, where c1 and c2 are children of an internal class c0, and m is an average number of an occurrence of term t in class c;
4 Assignments
0 Petitions
Accused Products
Abstract
A classifier for semi-structured documents and associated method dynamically and accurately classify documents with an implicit or explicit schema by taking advantage of the term-frequency and term distribution information inherent in the document. The system uses a structured vector model that allows like terms to be grouped together and dissimilar terms to be segregated based on their frequency and distribution within the sub-vectors of the structure vector, thus achieving context sensitivity. The final decision for assigning the class of a document is based on a mathematical comparison of the similarity of the terms in the structured vector to those of the various class models. The classifier of the present invention is capable of both learning and testing. In the learning phase the classifier develops models for classes with information it develops from the composite information gleaned from numerous training documents. Specifically, it develops a structured vector model for each training document. Then, within a given class of documents it adds and then normalizes the occurrences of terms.
-
Citations
26 Claims
-
1. A classifier, for use on a computer readable medium, for dynamically classifying a semi-structured document with a schema, comprising:
-
a vectorization module for parsing the document into a structured vector model, wherein the structured vector model is divided into a tree of sub-vectors to reflect a plurality of hierarchical levels beginning with a root and ending with a plurality of leaves;
a sorting module for searching the, document and for counting the occurrences of individual terms in the document;
the sorting module further accounting for the frequency of occurrence of the terms at each hierarchical level to achieve a high contextual sensitivity;
a testing module for assigning a class to the document by using a statistical model based on probability calculation to create a classification model; and
wherein the classification model assigns a class to the document that maximizes a posteriori class probability Pr[c|d,Fk] according to the following expression;
where d is the document, p(c) is a prior distribution on the class c;
c′
is a class in a set of documents;
pd is a path to a structure node ed from a root;
n is a number of occurrences of term t in pd, f is a maximum likelihood estimation;
Fk is a set of selected terms;
F is a Fisher index defined by the following equation, where c1 and c2 are children of an internal class c0, and m is an average number of an occurrence of term t in class c;
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A software program product for dynamically classifying a semi-structured document with a schema, comprising:
-
a vectorization module for parsing the document into a structured vector model, wherein the structured vector model is divided into a tree of sub-vectors to reflect a plurality of hierarchical levels beginning with a root and ending with a plurality of leaves;
a sorting module for searching the document and for counting the occurrences of individual terms in the document;
the sorting module further accounting for the frequency of occurrence of the terms at each hierarchical level to achieve a high contextual sensitivity;
a testing module for assigning a class to the document by using a statistical model based on probability calculation to create a classification model; and
wherein the classification model assigns a class to the document that maximizes a posteriori class probability Pr[c|d,Fk] according to the following expression;
where d is the document, p(c) is a prior distribution on the class c;
c′
is a class in a set of documents;
pd is a path to a structure node ed from a root;
n is a number of occurrences of term t in pd, f is a maximum likelihood estimation;
Fk is a set of selected terms;
F is a Fisher index defined by the following equation, where c1 and c2 are children of an internal class c0, and m is an average number of an occurrence of term t in class c;
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A method for dynamically classifying a semi-structured document, comprising:
-
parsing the document into a structured vector model;
dividing the structured vector model into a tree of sub-vectors to reflect a plurality of hierarchical levels beginning with a root and ending with a plurality of leaves;
searching the document and counting the occurrences of individual terms in the document;
accounting for the frequency of occurrence of the terms at each hierarchical level to achieve a high contextual sensitivity;
assigning a class to the document by using a statistical model based on probability calculation to create a classification model; and
wherein the classification model assigns a class to the document that maximizes a posteriori class probability Pr[c|d,Fk] according to the following expression;
where d is the document, p(c) is a prior distribution on the class c;
c′
is a class in a set of documents;
pd is a path to a structure node ed from a root;
n is a number of occurrences of term t in pd, f is a maximum likelihood estimation;
Fk is a set of selected terms;
F is a Fisher index defined by the following equation, where c1 and c2 are children of an internal class c0, and m is an average number of an occurrence of term t in class c;
- View Dependent Claims (22, 23, 24)
-
-
25. A method for dynamically classifying a semi-structured document, comprising:
-
parsing the document into a structured vector;
organizing the structured vector into a tree comprised of any of sub-vectors or structured vectors, to reflect a plurality of hierarchical levels in the document, beginning with a root and ending with a plurality of leaves;
searching the document and counting the occurrences of individual terms in the document;
accounting for the frequency of occurrence of the terms at each hierarchical level to achieve a high contextual sensitivity;
assigning a class to the document based on both term frequency and term distribution information and structure within the structured vector of the document, by using a statistical model based on probability calculation to create a classification model; and
wherein the classification model assigns a class to the document that maximizes a posteriori class probability Pr[c|d,Fk] according to the following expression;
where d is the document, p(c) is a prior distribution on the class c;
c′
is a class in a set of documents;
pd is a path to a structure node ed from a root;
n is a number of occurrences of term t in pd, f is a maximum likelihood estimation;
Fk is a set of selected terms;
F is a Fisher index defined by the following equation, where c1 and c2 are children of an internal class c0, and m is an average number of an occurrence of term t in class c;
- View Dependent Claims (26)
-
Specification