SYSTEM AND METHOD FOR AUTOMATIC DOCUMENT CLASSIFICATION IN EDISCOVERY, COMPLIANCE AND LEGACY INFORMATION CLEAN-UP
First Claim
1. A computer implemented system for automatic document classification, the system comprising:
- an extraction module configured to extract structural, syntactical and/or semantic information from a document and normalize the extracted information;
a machine learning module configured to generate a model representation for automatic document classification based on feature vectors built from the normalized and extracted semantic information for supervised and/or unsupervised clustering or machine learning; and
a classification module configured to select a non-classified document from a document collection, and via the extraction module extract normalized structural, syntactical and/or semantic information from the selected document, and generate via the machine learning module a model representation of the selected document based on feature vectors, and match the model representation of the selected document against the machine learning model representation to generate a document category, and/or classification for display to a user.
3 Assignments
0 Petitions
Accused Products
Abstract
A system, method and computer program product for automatic document classification, including an extraction module configured to extract structural, syntactical and/or semantic information from a document and normalize the extracted information; a machine learning module configured to generate a model representation for automatic document classification based on feature vectors built from the normalized and extracted semantic information for supervised and/or unsupervised clustering or machine learning; and a classification module configured to select a non-classified document from a document collection, and via the extraction module extract normalized structural, syntactical and/or semantic information from the selected document, and generate via the machine learning module a model representation of the selected document based on feature vectors, and match the model representation of the selected document against the machine learning model representation to generate a document category, and/or classification for display to a user.
-
Citations
15 Claims
-
1. A computer implemented system for automatic document classification, the system comprising:
-
an extraction module configured to extract structural, syntactical and/or semantic information from a document and normalize the extracted information; a machine learning module configured to generate a model representation for automatic document classification based on feature vectors built from the normalized and extracted semantic information for supervised and/or unsupervised clustering or machine learning; and a classification module configured to select a non-classified document from a document collection, and via the extraction module extract normalized structural, syntactical and/or semantic information from the selected document, and generate via the machine learning module a model representation of the selected document based on feature vectors, and match the model representation of the selected document against the machine learning model representation to generate a document category, and/or classification for display to a user. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A computer implemented method for automatic document classification, the method comprising:
-
extracting with an extraction module structural, syntactical and/or semantic information from a document and normalizing with the extraction module the extracted information; generating with a machine learning module a model representation for automatic document classification based on feature vectors built from the normalized and extracted semantic information for supervised and/or unsupervised clustering or machine learning; and selecting with a classification module a non-classified document from a document collection, and extracting via the extraction module normalized structural, syntactical and/or semantic information from the selected document, and generating via the machine learning module a model representation of the selected document based on feature vectors, and matching with the classification module the model representation of the selected document against the machine learning model representation and generating with the classification module a document category, and/or classification for display to a user. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A computer program product for automatic document classification and including one or more computer readable instructions embedded on a tangible, non-transitory computer readable medium and configured to cause one or more computer processors to perform the steps of:
-
extracting with an extraction module structural, syntactical and/or semantic information from a document and normalizing with the extraction module the extracted information; generating with a machine learning module a model representation for automatic document classification based on feature vectors built from the normalized and extracted semantic information for supervised and/or unsupervised clustering or machine learning; and selecting with a classification module a non-classified document from a document collection, and extracting via the extraction module normalized structural, syntactical and/or semantic information from the selected document, and generating via the machine learning module a model representation of the selected document based on feature vectors, and matching with the classification module the model representation of the selected document against the machine learning model representation and generating with the classification module a document category, and/or classification for display to a user. - View Dependent Claims (12, 13, 14, 15)
-
Specification