Methods and apparatuses for classifying electronic documents
First Claim
1. A method for processing a training set of electronic documents for document processing, the method implemented by computer instructions executing on a computer processor, said method comprising:
- receiving said training set of electronic documents that are each assigned to two or more categories;
determining a first set of frequencies with which a set of document features appear in the training set of electronic documents;
determining a second set of frequencies with which the set of document features appear in each of the two or more categories of training the set of electronic documents;
selecting a subset of said set of document features for defining a multi-dimensional vector space for processing documents, said subset of document features selected from said set of document features based upon said first set of frequencies and said second set of frequencies; and
reducing each electronic document of the training set of electronic documents to a multi-dimensional vector in the multi-dimensional vector space.
6 Assignments
0 Petitions
Accused Products
Abstract
Embodiments of the invention provide methods and apparatuses for classifying electronic documents (e.g., electronic communications) as either spam electronic documents or legitimate electronic documents. In accordance with one embodiment of the invention, each of a plurality of electronic communications is reduced to a corresponding multidimensional vector based on a multi-dimensional vector space. The multi-dimensional vectors represent corresponding electronic documents that have been classified as at least one type of electronic documents. Subsequent electronic documents to be classified are reduced to a corresponding multi-dimensional vector inserted into the multi-dimensional vector space. The electronic documents corresponding to an inserted multi-dimensional vector are classified based upon the proximity of the inserted multi-dimensional vector to at least one previously classified multi-dimensional vectors of the multi-dimensional vector space.
-
Citations
18 Claims
-
1. A method for processing a training set of electronic documents for document processing, the method implemented by computer instructions executing on a computer processor, said method comprising:
-
receiving said training set of electronic documents that are each assigned to two or more categories; determining a first set of frequencies with which a set of document features appear in the training set of electronic documents; determining a second set of frequencies with which the set of document features appear in each of the two or more categories of training the set of electronic documents; selecting a subset of said set of document features for defining a multi-dimensional vector space for processing documents, said subset of document features selected from said set of document features based upon said first set of frequencies and said second set of frequencies; and reducing each electronic document of the training set of electronic documents to a multi-dimensional vector in the multi-dimensional vector space. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A computer-readable medium, said computer-readable medium comprising a set of computer instructions that, when executed, implement a method for processing a training set of electronic documents for document processing, said method comprising:
-
receiving said training set of electronic documents that are each assigned to two or more categories; determining a first set of frequencies with which a set of document features appear in the training set of electronic documents; determining a second set of frequencies with which the set of document features appear in each of the two or more categories of training the set of electronic documents; selecting a subset of said set of document features for defining a multi-dimensional vector space for processing documents, said subset of document features selected from said set of document features based upon said first set of frequencies and said second set of frequencies; and reducing each electronic document of the training set of electronic documents to a multi-dimensional vector in the multi-dimensional vector space. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
Specification