INFORMATION CLASSIFICATION PARADIGM
First Claim
1. A method comprising:
- classifying an initial set of source documents into one of at least three groups based at least in part on at least one classification identifier, the at least three groups comprising a first group containing documents of interest, a second group containing documents not of interest and a third group containing documents where the interest level is undetermined; and
for each document classified into the third group;
extracting from a document at least one feature vector; and
classifying the document into either the first group or the second group based on the at least one feature vector.
2 Assignments
0 Petitions
Accused Products
Abstract
A mechanism to classify source documents into one of two categories, either likely to contain desired information or unlikely to contain desired information. Generally some form of rules based classification in conjunction with deeper analysis using advanced techniques on difficult cases is utilized. The rules based classification is generally good for eliminating cases from further consideration and for identifying documents of interest based on generally discernable relationships between data or based on the presence or absence of data. The deeper analysis is used to uncover more complex relationships between data that may identify documents of interest. Portions of the process may use the entire document while other portions of the process may use only a portion of the document.
32 Citations
20 Claims
-
1. A method comprising:
-
classifying an initial set of source documents into one of at least three groups based at least in part on at least one classification identifier, the at least three groups comprising a first group containing documents of interest, a second group containing documents not of interest and a third group containing documents where the interest level is undetermined; and
for each document classified into the third group;
extracting from a document at least one feature vector; and
classifying the document into either the first group or the second group based on the at least one feature vector. - View Dependent Claims (2, 3, 4)
-
-
5. A method comprising:
-
determining whether at least one currency identifier exists within a document and based on the presence or absence of the at lest one currency identifier, classifying the document either as of interest or as requiring further examination;
if the document requires further examination then generating at least one feature vector representing a number of characteristics of the information in the document; and
determining if the generated at least one feature vector indicates that the document is likely to be product related. - View Dependent Claims (6, 7, 8)
-
-
9. At least one computer-readable medium having executable instructions stored thereon comprising:
-
a first classifier adapted to classify an initial set of source documents into one of at least three groups, a first group containing documents of interest, a second group containing documents not of interest and a third group containing documents where the interest level is undetermined, the first classifier having input comprising at least one language dependent classification identifier used to make its classification decisions;
a feature extractor adapted to extract features from a document classified into the third group and to form at least one feature vector; and
a second classifier adapted to receive input comprising the at least one feature vector and language dependent model information and further adapted to classify the document associated with the at least one feature vector into either the first group or the second group based on the at lest one feature vector and the language dependent model information. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A system comprising:
-
a first classifier adapted to classify an initial set of source documents into one of at least three groups, a first group containing documents of interest, a second group containing documents not of interest and a third group containing documents where the interest level is undetermined, through the following method;
determining the presence or absence of at least one classification identifier in a source document from the initial set of source documents;
if the at least one classification identifier is absent within the source document then classifying the source document into the second group;
if the at least one classification identifier is present within the source document then extracting a snippet from the document and determining the presence or absence of at least one keyword within the snippet;
if at least one keyword is present within the snippet, then classifying the source document into the first group otherwise classifying the document into the third group;
a feature extractor adapted to extract features from the snippet associated with a document classified into the third group and to form at least one feature vector from the snippet; and
a second classifier adapted to receive input comprising the at least one feature vector and model information and further adapted to classify the document associated with the at least one feature vector into either the first group or the second group based on the at lest one feature vector and the model information. - View Dependent Claims (18, 19, 20)
-
Specification