Information classification paradigm
First Claim
1. A machine-implemented method for determining whether documents being searched are relevant to a desired piece of information, the method comprising:
- accessing documents stored in electronic form in a memory;
automatically classifying, by a processor, documents in an initial set of source documents of the electronic documents into one of at least three groups, where the classifying is performed by a first classifier comprising an untrained rules-based classifier applying rules to text of the documents to determine presence and absence of classification identifiers in the initial set of source documents, the first classifier comprising software executing on a machine, the at least three groups comprising a first group containing documents determined by the untrained rules-based classifier to be of interest to the desired piece of information, a second group containing documents determined by the untrained rules-based classifier to be not of interest to the desired piece of information, and a third group containing documents that the untrained rules-based classifier did not place in the first group or the second group, the first classifier classifying by the processor;
determining the presence or absence of at least one classification identifier in a source document;
if the at least one classification identifier is absent within the source document then classifying the source document into the second group;
if the at least one classification identifier is present within the source document then extracting, with a snippet extractor a snippet from the document and determining the presence or absence of at least one keyword within the snippet wherein the snippet is selected based on structures of the source document; and
if at least one keyword is present within the snippet, then classifying the source document into the first group, and otherwise classifying the source document into the third group; and
for each document classified into the third group, the processor;
extracting from a document at least one feature vector; and
classifying the document with a second classifier into either the first group or the second group based on the at least one feature vector, wherein the second classifier comprises a support vector machine (SVM) trained, prior to the classifying, with a plurality of labeled training documents, the labeled training documents having been labeled according to analysis of the documents prior to performance of the method, and where the second classifier comprises software executing on the machine.
2 Assignments
0 Petitions
Accused Products
Abstract
A mechanism to classify source documents into one of two categories, either likely to contain desired information or unlikely to contain desired information. Generally some form of rules based classification in conjunction with deeper analysis using advanced techniques on difficult cases is utilized. The rules based classification is generally good for eliminating cases from further consideration and for identifying documents of interest based on generally discernable relationships between data or based on the presence or absence of data. The deeper analysis is used to uncover more complex relationships between data that may identify documents of interest. Portions of the process may use the entire document while other portions of the process may use only a portion of the document.
286 Citations
13 Claims
-
1. A machine-implemented method for determining whether documents being searched are relevant to a desired piece of information, the method comprising:
-
accessing documents stored in electronic form in a memory; automatically classifying, by a processor, documents in an initial set of source documents of the electronic documents into one of at least three groups, where the classifying is performed by a first classifier comprising an untrained rules-based classifier applying rules to text of the documents to determine presence and absence of classification identifiers in the initial set of source documents, the first classifier comprising software executing on a machine, the at least three groups comprising a first group containing documents determined by the untrained rules-based classifier to be of interest to the desired piece of information, a second group containing documents determined by the untrained rules-based classifier to be not of interest to the desired piece of information, and a third group containing documents that the untrained rules-based classifier did not place in the first group or the second group, the first classifier classifying by the processor; determining the presence or absence of at least one classification identifier in a source document; if the at least one classification identifier is absent within the source document then classifying the source document into the second group; if the at least one classification identifier is present within the source document then extracting, with a snippet extractor a snippet from the document and determining the presence or absence of at least one keyword within the snippet wherein the snippet is selected based on structures of the source document; and if at least one keyword is present within the snippet, then classifying the source document into the first group, and otherwise classifying the source document into the third group; and for each document classified into the third group, the processor; extracting from a document at least one feature vector; and classifying the document with a second classifier into either the first group or the second group based on the at least one feature vector, wherein the second classifier comprises a support vector machine (SVM) trained, prior to the classifying, with a plurality of labeled training documents, the labeled training documents having been labeled according to analysis of the documents prior to performance of the method, and where the second classifier comprises software executing on the machine. - View Dependent Claims (2, 3, 4)
-
-
5. A storage memory having executable instructions stored thereon, when executed by a processor carry out a method comprising:
-
a first classifier comprising an untrained rules-based classifier classifying an initial set of electronic source documents into one of at least three groups by applying rules to text of the documents to determine presence and absence of classification identifiers in the initial set of electronic documents, a first group containing documents determined by the first classifier to have a sufficient probability of being of interest for a given piece of information, a second group containing documents determined by the first classifier to have sufficient probability of being not of interest for the given piece of information, and a third group containing documents analyzed by the first classifier but not determined by the first classifier to have sufficient probability of being of interest or not of interest for the given piece of information, the first classifier having input comprising at least one classification identifier used to make its classification determinations, the first classifier classifying by; determining the presence or absence of the at least one classification identifier in a source document; if the at least one classification identifier is absent within the source document then classifying the source document into the second group; if the at least one classification identifier is present within the source document then extracting, with a snippet extractor, a snippet from the document and determining the presence or absence of at least one keyword within the snippet, wherein the snippet is extracted based on structures of the source document; and if at least one keyword is present within the snippet, then classifying the source document into the first group, and otherwise classifying the document into the third group; a feature extractor extracts features from a document classified into the third group and to form at least one feature vector corresponding to the document; and a second classifier receiving input comprising the at least one feature vector and further classifying the document associated with the at least one feature vector into either the first group or the second group based on the at least one feature vector, the second classifier comprising a learning to classify documents by being trained with labeled training documents prior to performing classification. - View Dependent Claims (6, 7, 8, 9)
-
-
10. A computing device for determining whether electronic documents being searched are relevant to a given piece of information, the device comprising:
-
a first classifier in a memory of the computing device, the first classifier comprising an untrained rules-based classifier classifying documents in an initial set of source documents of the electronic documents into at least three groups by applying rules to text of the documents to determine presence and absence of classification identifiers in the initial set of source documents, a first group containing documents determined by the first classifier to be of interest to the desired piece of information, a second group containing documents determined by the first classifier not to be of interest to the desired piece of information, and a third group containing documents that the first classifier did not classify into the first group or second group, the first classifier using a processor of the computing device to classify by; applying rules describing text content to determine the presence or absence of at least one classification identifier in a source document from the initial set of source documents; if the at least one classification identifier is absent from the source document then classifying the source document into the second group; if the at least one classification identifier is present within the source document then extracting a snippet from the document containing the classification identifier and determining the presence or absence of at least one keyword within the snippet; if the at least one keyword is present within the snippet, then classifying the source document into the first group, and otherwise classifying the document into the third group; a feature extractor, in the memory of the computing device, extracting, with the processor, features from the snippet associated with the document classified into the third group and to form at least one feature vector from the snippet; and a second classifier, in the memory of the computing device, receiving input comprising the at least one feature vector and further classifying the document associated with the at least one feature vector into either the first group or the second group based on the at least one feature vector, wherein the second classifier learns to classify documents by being trained with labeled training documents prior to the classifying such that classification outcomes for the inputted feature vector depend on the training documents. - View Dependent Claims (11, 12, 13)
-
Specification