Method and system for commercial document image classification
First Claim
1. A method of automatic commercial document image classification using a computer for performing the steps of:
- automatically obtaining the document layout using the salient features of the document images layout, said features consisting of text blocks, geometric line segments and the contents (sets of words) of said text blocks;
defining distances between such layouts;
automatically creating a plurality of classification template layouts directly from said distances;
Ordering said plurality of classification template layouts and generating a ranked list of candidate template layouts best matching a given input document layout in such a way that the classification template layout most similar to the document layout to be classified is at the top of the list (highest ranked layout), the next likely candidate template layout is in the second position and so forth;
classifying the input image as belonging to the class of the top template layout in said ranked list of template layouts.
0 Assignments
0 Petitions
Accused Products
Abstract
A completely automated method for classifying electronic images of commercial documents such as invoices, bills of lading, explanations of benefits, etc. is based on the layout of the documents. A document is classified as belonging to a class of similarly laid out images if its distance, determined by a set of pre-defined metrics, to any of the template layouts from the same class of template layouts does not exceed a certain user-defined threshold. Alternatively, if the sum of distances from a given image to several template layouts from the same class does not exceed a user-selected threshold, the image is deemed to belong to a class of images with a specific layout.
-
Citations
12 Claims
-
1. A method of automatic commercial document image classification using a computer for performing the steps of:
-
automatically obtaining the document layout using the salient features of the document images layout, said features consisting of text blocks, geometric line segments and the contents (sets of words) of said text blocks; defining distances between such layouts; automatically creating a plurality of classification template layouts directly from said distances; Ordering said plurality of classification template layouts and generating a ranked list of candidate template layouts best matching a given input document layout in such a way that the classification template layout most similar to the document layout to be classified is at the top of the list (highest ranked layout), the next likely candidate template layout is in the second position and so forth; classifying the input image as belonging to the class of the top template layout in said ranked list of template layouts. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
Specification