Identifying and processing a number of features identified in a document to determine a type of the document
First Claim
Patent Images
1. A method, comprising:
- receiving, by at least one server communicatively coupled to a network, an input document;
identifying, by the at least one server, a plurality of features in the input document, the plurality of features including sequences of text extracted from the input document;
generating, by the at least one server, a feature vector of the input document based upon the sequences of text;
identifying, by the at least one server, a plurality of signature vectors based upon an input training dataset and at least one cross-type frequency vector;
comparing, by the at least one server, the feature vector of the input document to each of a plurality of signature vectors to determine a primary type of the input document, wherein comparing the feature vector of the input document to each of the plurality of signature vectors to determine the primary type of the input document includes identifying a signature vector that maximizes the expression V·
(Ct/D), where V is the feature vector, Ct is a signature vector t in the plurality of signature vectors, and D is the at least one cross-type frequency vector; and
storing, by the at least one server, the primary type of the input document into a storage system in communication with the at least one server.
3 Assignments
0 Petitions
Accused Products
Abstract
A system and method for document classification are presented. An input document is received (e.g., by at least one server communicatively coupled to a network). A plurality of features are identified in the input document. The plurality of features include sequences of text extracted from the input document. A feature vector of the input document is generated based upon the sequences of text, and the feature vector of the input document is compared to each of a plurality of signature vectors to determine a primary type of the input document. The primary type of the input document is stored into a storage system in communication with the at least one server.
79 Citations
14 Claims
-
1. A method, comprising:
-
receiving, by at least one server communicatively coupled to a network, an input document; identifying, by the at least one server, a plurality of features in the input document, the plurality of features including sequences of text extracted from the input document; generating, by the at least one server, a feature vector of the input document based upon the sequences of text; identifying, by the at least one server, a plurality of signature vectors based upon an input training dataset and at least one cross-type frequency vector; comparing, by the at least one server, the feature vector of the input document to each of a plurality of signature vectors to determine a primary type of the input document, wherein comparing the feature vector of the input document to each of the plurality of signature vectors to determine the primary type of the input document includes identifying a signature vector that maximizes the expression V·
(Ct/D), where V is the feature vector, Ct is a signature vector t in the plurality of signature vectors, and D is the at least one cross-type frequency vector; andstoring, by the at least one server, the primary type of the input document into a storage system in communication with the at least one server. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system, comprising:
a server computer configured to communicate with a content source using a network, the server computer being configured to; receive an input document, identify a plurality of features in the input document, the plurality of features including sequences of text extracted from the input document, generate a feature vector of the input document based upon the sequences of text, identify a plurality of signature vectors based upon an input training dataset and at least one cross-type frequency vector, compare the feature vector of the input document to each of a plurality of signature vectors to determine a primary type of the input document wherein comparing the feature vector of the input document to each of the plurality of signature vectors to determine the primary type of the input document includes identifying a signature vector that maximizes the expression V·
(Ct/D), where V is the feature vector, Ct is a signature vector t in the plurality of signature vectors, and D is the at least one cross-type frequency vector, andstore the primary type of the input document into a storage system. - View Dependent Claims (9, 10, 11, 12, 13, 14)
Specification