Automatic document classification using lexical and physical features
First Claim
1. A method for classifying a scanned document, comprising:
- extracting non-graphical physical attributes of the scanned document using a processor programmed to extract said physical attributes, said non-graphical physical attributes excluding graphical attributes relating to the way content is organized and displayed within the scanned document; and
classifying the scanned document based on the extracted physical attributes of the scanned document.
8 Assignments
0 Petitions
Accused Products
Abstract
An automatic document classification system is described that uses lexical and physical features to assign a class ciεC{c1, c2, . . . , ci} to a document d. The primary lexical features are the result of a feature selection method known as Orthogonal Centroid Feature Selection (OCFS). Additional information may be gathered on character type frequencies (digits, letters, and symbols) within d. Physical information is assembled through image analysis to yield physical attributes such as document dimensionality, text alignment, and color distribution. The resulting lexical and physical information is combined into an input vector X and is used to train a supervised neural network to perform the classification.
-
Citations
48 Claims
-
1. A method for classifying a scanned document, comprising:
-
extracting non-graphical physical attributes of the scanned document using a processor programmed to extract said physical attributes, said non-graphical physical attributes excluding graphical attributes relating to the way content is organized and displayed within the scanned document; and classifying the scanned document based on the extracted physical attributes of the scanned document. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A non-transitory computer readable storage medium having instructions stored thereon that when executed by a processor causes said processor to implement a method for classifying a scanned document, said method comprising:
-
extracting non-graphical physical attributes of the scanned document, said non-graphical physical attributes excluding graphical attributes relating to the way content is organized and displayed within the scanned document; and classifying the scanned document based on the extracted physical attributes of the scanned document. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)
-
-
33. A system for classifying a document, comprising:
-
a scanner that scans and digitizes images on the document; a processing system adapted to extract non-graphical physical attributes of the scanned document, said non-graphical physical attributes excluding graphical attributes relating to the way content is organized and displayed within the scanned document, and to classify the scanned document based on the extracted physical attributes of the scanned document. - View Dependent Claims (34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48)
-
Specification