Automated data classification

US 9,483,740 B1
Filed: 12/16/2013
Issued: 11/01/2016
Est. Priority Date: 09/06/2012
Status: Active Grant

First Claim

Patent Images

1. A method, comprising:

identifying, by at least one server communicatively coupled to a network, a plurality of training tokens, each training token including a token retrieved from a content source and a classification of the token;

for each training token in the plurality of training tokens;

identifying, by the at least one server, a plurality of n-gram sequences,generating, by the at least one server, a plurality of features for the plurality of n-gram sequences, andgenerating, by the at least one server, first training data using the token retrieved from the content source, the plurality of features, and the classification of the token;

training a first classifier with the first training data;

storing, by the at least one server, the first classifier into a storage system in communication with the at least one server;

for each training token in the plurality of training tokens;

identifying a plurality of related tokens in the content source,for each of the related tokens in the content source;

identifying a second plurality of n-gram sequences, andgenerating a second plurality of features using the second plurality of n-gram sequences and by executing the first classifier on the related token to generate a probable classification of the related token;

generating second training data using the second plurality of features;

training a second classifier with the second training data; and

storing, by the at least one server, the second classifier into the storage system in communication with the at least one server.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for data classification are presented. A plurality of training tokens are identified by at least one server communicatively coupled to a network. Each training token includes a token retrieved from a content source and a classification of the token. For each training token in the plurality of training tokens, a plurality of n-gram sequences are identified, a plurality of features for the plurality of n-gram sequences are generated, and first training data is generated using the token retrieved from the content source, the plurality of features, and the classification of the token. A first classifier is trained with the first training data, and the first classifier is stored into a storage system in communication with the at least one server.

Citations

18 Claims

1. A method, comprising:
- identifying, by at least one server communicatively coupled to a network, a plurality of training tokens, each training token including a token retrieved from a content source and a classification of the token;
  
  for each training token in the plurality of training tokens;
  
  identifying, by the at least one server, a plurality of n-gram sequences,generating, by the at least one server, a plurality of features for the plurality of n-gram sequences, andgenerating, by the at least one server, first training data using the token retrieved from the content source, the plurality of features, and the classification of the token;
  
  training a first classifier with the first training data;
  
  storing, by the at least one server, the first classifier into a storage system in communication with the at least one server;
  
  for each training token in the plurality of training tokens;
  
  identifying a plurality of related tokens in the content source,for each of the related tokens in the content source;
  
  identifying a second plurality of n-gram sequences, andgenerating a second plurality of features using the second plurality of n-gram sequences and by executing the first classifier on the related token to generate a probable classification of the related token;
  
  generating second training data using the second plurality of features;
  
  training a second classifier with the second training data; and
  
  storing, by the at least one server, the second classifier into the storage system in communication with the at least one server.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein each training token includes an indication of a visual appearance of the token retrieved from the content source.
  - 3. The method of claim 2, wherein the indication of the visual appearance includes at least one of a font size, font style, color, and position.
  - 4. The method of claim 3, wherein the indication of the visual appearance includes an orientation of the token.
  - 5. The method of claim 1, wherein the plurality of related tokens share a visual attribute with at least one of the plurality of training tokens.
  - 6. The method of claim 5, where the visual attribute is a font style or a font size.
  - 7. The method of claim 1, wherein the content source is a web page.
  - 8. The method of claim 1, wherein the first classifier is trained using stochastic gradient boosting.

9. A method, comprising:
- identifying, by at least one server communicatively coupled to a network, a training token including a token retrieved from a content source and a classification of the token;
  
  generating, by the at least one server, features for the training token;
  
  training, by the at least one server, a classifier using the token retrieved from the content source, the features for the training token, and the classification; and
  
  storing, by the at least one server, the classifier into a storage system in communication with the at least one server;
  
  identifying, by the at least one server, a related token;
  
  identifying second features for the related token by executing the classifier on the related token to generate a probable classification of the related token;
  
  training, by the at least one server, a second classifier using the related token and the second features; and
  
  storing, by the at least one server, the second classifier into a storage system in communication with the at least one server.
- View Dependent Claims (10, 11, 12, 13, 14)
- - 10. The method of claim 9, wherein the training token includes an indication of a visual appearance of the token retrieved from the content source.
  - 11. The method of claim 10, wherein the indication of the visual appearance includes at least one of a font size, font style, color, and position.
  - 12. The method of claim 11, wherein the indication of the visual appearance includes an orientation of the token.
  - 13. The method of claim 9, wherein the content source is a web page.
  - 14. The method of claim 9, wherein the classifier is trained using stochastic gradient boosting.

15. A system, comprising:
- a server computer configured to communicate with a content source using a network, the server computer being configured to;
  
  identify a plurality of training tokens, each training token including a token retrieved from the content source and a classification of the token;
  
  for each training token in the plurality of training tokens;
  
  identify a plurality of n-gram sequences,generate a plurality of features for the plurality of n-gram sequences, andgenerate first training data using the token retrieved from the content source, the plurality of features, and the classification of the token;
  
  train a first classifier with the first training data;
  
  store the first classifier into a storage system in communication with the server computer;
  
  for each training token in the plurality of training tokens;
  
  identify a plurality of related tokens in the content source,for each of the related tokens in the content source;
  
  identifying a second plurality of n-gram sequences, andgenerating a second plurality of features using the second plurality of n-gram sequences and by executing the first classifier on the related token to generate a probable classification of the related token;
  
  generate second training data using the second plurality of features;
  
  train a second classifier with the second training data; and
  
  store, by the server computer, the second classifier into the storage system in communication with the at least one server.
- View Dependent Claims (16, 17, 18)
- - 16. The system of claim 15, wherein each training token includes an indication of a visual appearance of the token retrieved from the content source.
  - 17. The system of claim 16, wherein the indication of the visual appearance includes at least one of a font size, font style, color, and position.
  - 18. The system of claim 17, wherein the indication of the visual appearance includes an orientation of the token.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Locu Incorporated (GoDaddy, Inc.)
Original Assignee
Go Daddy Operating Company LLC (GoDaddy, Inc.)
Inventors
Ansel, Jason, Marcus, Adam, Olszewski, Marek, Mierle, Keir
Primary Examiner(s)
McIntosh, Andrew

Application Number

US14/108,119
Time in Patent Office

1,051 Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/285   Clustering or classification

G06F 16/35   Clustering; Classification

G06F 16/355   Class or cluster creation o...

G06F 40/117   Tagging; Marking up details...

G06F 40/284   Lexical analysis, e.g. toke...

G06N 20/00   Machine learning

Automated data classification

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Automated data classification

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links