Automated data classification
First Claim
1. A method, comprising:
- identifying, by at least one server communicatively coupled to a network, a plurality of training tokens, each training token including a token retrieved from a content source and a classification of the token;
for each training token in the plurality of training tokens;
identifying, by the at least one server, a plurality of n-gram sequences,generating, by the at least one server, a plurality of features for the plurality of n-gram sequences, andgenerating, by the at least one server, first training data using the token retrieved from the content source, the plurality of features, and the classification of the token;
training a first classifier with the first training data;
storing, by the at least one server, the first classifier into a storage system in communication with the at least one server;
for each training token in the plurality of training tokens;
identifying a plurality of related tokens in the content source,for each of the related tokens in the content source;
identifying a second plurality of n-gram sequences, andgenerating a second plurality of features using the second plurality of n-gram sequences and by executing the first classifier on the related token to generate a probable classification of the related token;
generating second training data using the second plurality of features;
training a second classifier with the second training data; and
storing, by the at least one server, the second classifier into the storage system in communication with the at least one server.
3 Assignments
0 Petitions
Accused Products
Abstract
A system and method for data classification are presented. A plurality of training tokens are identified by at least one server communicatively coupled to a network. Each training token includes a token retrieved from a content source and a classification of the token. For each training token in the plurality of training tokens, a plurality of n-gram sequences are identified, a plurality of features for the plurality of n-gram sequences are generated, and first training data is generated using the token retrieved from the content source, the plurality of features, and the classification of the token. A first classifier is trained with the first training data, and the first classifier is stored into a storage system in communication with the at least one server.
-
Citations
18 Claims
-
1. A method, comprising:
-
identifying, by at least one server communicatively coupled to a network, a plurality of training tokens, each training token including a token retrieved from a content source and a classification of the token; for each training token in the plurality of training tokens; identifying, by the at least one server, a plurality of n-gram sequences, generating, by the at least one server, a plurality of features for the plurality of n-gram sequences, and generating, by the at least one server, first training data using the token retrieved from the content source, the plurality of features, and the classification of the token; training a first classifier with the first training data; storing, by the at least one server, the first classifier into a storage system in communication with the at least one server; for each training token in the plurality of training tokens; identifying a plurality of related tokens in the content source, for each of the related tokens in the content source; identifying a second plurality of n-gram sequences, and generating a second plurality of features using the second plurality of n-gram sequences and by executing the first classifier on the related token to generate a probable classification of the related token; generating second training data using the second plurality of features; training a second classifier with the second training data; and storing, by the at least one server, the second classifier into the storage system in communication with the at least one server. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A method, comprising:
-
identifying, by at least one server communicatively coupled to a network, a training token including a token retrieved from a content source and a classification of the token; generating, by the at least one server, features for the training token; training, by the at least one server, a classifier using the token retrieved from the content source, the features for the training token, and the classification; and storing, by the at least one server, the classifier into a storage system in communication with the at least one server; identifying, by the at least one server, a related token; identifying second features for the related token by executing the classifier on the related token to generate a probable classification of the related token; training, by the at least one server, a second classifier using the related token and the second features; and storing, by the at least one server, the second classifier into a storage system in communication with the at least one server. - View Dependent Claims (10, 11, 12, 13, 14)
-
-
15. A system, comprising:
a server computer configured to communicate with a content source using a network, the server computer being configured to; identify a plurality of training tokens, each training token including a token retrieved from the content source and a classification of the token; for each training token in the plurality of training tokens; identify a plurality of n-gram sequences, generate a plurality of features for the plurality of n-gram sequences, and generate first training data using the token retrieved from the content source, the plurality of features, and the classification of the token; train a first classifier with the first training data; store the first classifier into a storage system in communication with the server computer; for each training token in the plurality of training tokens; identify a plurality of related tokens in the content source, for each of the related tokens in the content source; identifying a second plurality of n-gram sequences, and generating a second plurality of features using the second plurality of n-gram sequences and by executing the first classifier on the related token to generate a probable classification of the related token; generate second training data using the second plurality of features; train a second classifier with the second training data; and store, by the server computer, the second classifier into the storage system in communication with the at least one server. - View Dependent Claims (16, 17, 18)
Specification