Data classification using machine learning techniques
First Claim
1. An article of manufacture comprising:
- a non-transitory program storage medium readable by a computer, where the medium tangibly embodies one or more programs of instructions executable by a computer to perform a method of data classification, the one or more programs of instructions comprising;
instructions for receiving at least one labeled seed document;
instructions for receiving unlabeled documents;
instructions for receiving at least one predetermined cost factor;
instructions for training a transductive classifier using the at least one predetermined cost factor, the at least one seed document, and the unlabeled documents;
instructions for classifying the unlabeled documents having a confidence level above a predefined threshold into a plurality of categories using the classifier;
instructions for reclassifying at least some of the categorized documents previously categorized by a different classifier into the categories using the classifier; and
instructions for outputting identifiers of the categorized documents to at least one of a user, another system, and another process.
10 Assignments
0 Petitions
Accused Products
Abstract
A system and article of manufacture enabling adapting to a shift in document content according to one embodiment of the present invention includes instructions for: receiving at least one labeled seed document; receiving unlabeled documents; receiving at least one predetermined cost factor; training a transductive classifier using the at least one predetermined cost factor, the at least one seed document, and the unlabeled documents; classifying the unlabeled documents having a confidence level above a predefined threshold into a plurality of categories using the classifier; reclassifying at least some of the categorized documents into the categories using the classifier; and outputting identifiers of the categorized documents to at least one of a user, another system, and another process. Systems and articles of manufacture for separating documents are also presented. Systems and articles of manufacture for document searching are also presented.
-
Citations
28 Claims
-
1. An article of manufacture comprising:
a non-transitory program storage medium readable by a computer, where the medium tangibly embodies one or more programs of instructions executable by a computer to perform a method of data classification, the one or more programs of instructions comprising; instructions for receiving at least one labeled seed document; instructions for receiving unlabeled documents; instructions for receiving at least one predetermined cost factor; instructions for training a transductive classifier using the at least one predetermined cost factor, the at least one seed document, and the unlabeled documents; instructions for classifying the unlabeled documents having a confidence level above a predefined threshold into a plurality of categories using the classifier; instructions for reclassifying at least some of the categorized documents previously categorized by a different classifier into the categories using the classifier; and instructions for outputting identifiers of the categorized documents to at least one of a user, another system, and another process. - View Dependent Claims (2, 3, 4, 5, 6)
-
7. An article of manufacture comprising:
a non-transitory program storage medium readable by a computer, where the medium tangibly embodies one or more programs of instructions executable by a computer to perform a method for separating documents, the one or more programs of instructions comprising; instructions for receiving labeled data; instructions for receiving a sequence of unlabeled documents; instructions for adapting probabilistic classification rules using transduction based on the labeled data and the unlabeled documents; instructions for updating weights used for document separation according to the probabilistic classification rules; instructions for determining locations of separations in the sequence of documents; instructions for outputting indicators of the determined locations of the separations in the sequence to at least one of a user, another system, and another process; and instructions for flagging the documents with codes, the codes correlating to the indicators.
-
8. An article of manufacture comprising:
a non-transitory program storage medium readable by a computer, where the medium tangibly embodies one or more programs of instructions executable by a computer to perform a method of document searching, the one or more programs of instructions comprising; instructions for receiving a search query; instructions for retrieving documents based on the search query; instructions for outputting the documents; instructions for receiving user-entered labels for at least some of the documents, the labels being indicative of a relevance of the document to the search query; instructions for training a classifier based on the search query and the user-entered labels; instructions for performing a document classification technique on the documents using the classifier for reclassifying the documents; and instructions for outputting identifiers of at least some of the documents based on the classification thereof. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
15. A system comprising:
-
a processor; and a program storage medium and/or memory storing one or more programs of instructions executable by the processor to perform a method of data classification, the one or more programs of instructions comprising; instructions for receiving at least one labeled seed document; instructions for receiving unlabeled documents; instructions for receiving at least one predetermined cost factor; instructions for training a transductive classifier using the at least one predetermined cost factor, the at least one seed document, and the unlabeled documents; instructions for classifying the unlabeled documents having a confidence level above a predefined threshold into a plurality of categories using the classifier; instructions for reclassifying at least some of the categorized documents previously categorized by a different classifier into the categories using the classifier; and instructions for outputting identifiers of the categorized documents to at least one of a user, another system, and another process. - View Dependent Claims (16, 17, 18, 19, 20)
-
-
21. A system comprising:
-
a processor; and a program storage medium and/or memory storing one or more programs of instructions executable by the processor to perform a method for separating documents, the one or more programs of instructions comprising; instructions for receiving labeled data; instructions for receiving a sequence of unlabeled documents; instructions for adapting probabilistic classification rules using transduction based on the labeled data and the unlabeled documents; instructions for updating weights used for document separation according to the probabilistic classification rules; instructions for determining locations of separations in the sequence of documents; instructions for outputting indicators of the determined locations of the separations in the sequence to at least one of a user, another system, and another process; and instructions for flagging the documents with codes, the codes correlating to the indicators.
-
-
22. A system comprising:
-
a processor; and a program storage medium and/or memory storing one or more programs of instructions executable by the processor to perform a method of document searching, the one or more programs of instructions comprising; instructions for receiving a search query; instructions for retrieving documents based on the search query; instructions for outputting the documents; instructions for receiving user-entered labels for at least some of the documents, the labels being indicative of a relevance of the document to the search query; instructions for training a classifier based on the search query and the user-entered labels; instructions for performing a document classification technique on at least some of the documents using the classifier for reclassifying the at least some of the documents; and instructions for outputting identifiers of the at least some of the documents based on the classification thereof. - View Dependent Claims (23, 24, 25, 26, 27, 28)
-
Specification