Automatic document separation
First Claim
1. A method for automatically separating documents represented within a plurality of images by delineating document boundaries and identifying document types in accordance with classification rules, the method comprising:
- automatically generating classification rules that predict a document type or subdocument type for each of the plurality of images based on textual information and/or graphical information represented in each respective one of the plurality of images, wherein the classification rules are generated based on analyzing textual information and/or graphical information of a plurality of training images using one or more of;
a probabilistic network;
relational algebra; and
machine learning techniquesautomatically generating one or more identifiers for identifying which of a plurality of document images belongs to which of a plurality of categories;
automatically categorizing a plurality of document images into a plurality of predetermined categories based on analyzing textual information and/or image characteristics of each of the plurality of document images using the classification rules, wherein the step of automatically categorizing comprises;
producing an output score for each document image based on the analysis thereof using the classification rules, wherein each output score represents an estimated document type probability or a subdocument type probability; and
using a graph search algorithm to determine an optimum categorization sequence from a plurality of possible categorization sequences for the plurality of document images based on the output scores; and
separating documents within the plurality of document images from one another by either;
electronically associating at least one computer-generated label with at least some of the plurality of document images, each label corresponding to a different one of the plurality of categories and comprising one of the one or more identifiers generated for identifying which of the plurality of document images belongs to which of the plurality of categories;
orinserting one or more computer-generated separation pages between at least some of the plurality of document images to delineate images belonging to different ones of the plurality of categories, each separation page comprising one of the one or more identifiers generated for identifying which of the plurality of document images belongs to which of the plurality of categories;
orboth electronically associating the at least one computer-generated label with at least some of the plurality of document images and inserting the one or more computer-generated separation pages between at least some of the plurality of document images.
8 Assignments
0 Petitions
Accused Products
Abstract
A method and system for delineating document and/or subdocument boundaries and identifying document and/or subdocument types, the method comprising: automatically generating at least one identifier for identifying which of a plurality of document and/or subdocument images belongs to which of a plurality of categories. The method and/or system optionally may include automatically categorizing a plurality of document and/or subdocument images into a plurality of predetermined categories in accordance with classification rules for said categories.
40 Citations
21 Claims
-
1. A method for automatically separating documents represented within a plurality of images by delineating document boundaries and identifying document types in accordance with classification rules, the method comprising:
-
automatically generating classification rules that predict a document type or subdocument type for each of the plurality of images based on textual information and/or graphical information represented in each respective one of the plurality of images, wherein the classification rules are generated based on analyzing textual information and/or graphical information of a plurality of training images using one or more of;
a probabilistic network;
relational algebra; and
machine learning techniquesautomatically generating one or more identifiers for identifying which of a plurality of document images belongs to which of a plurality of categories; automatically categorizing a plurality of document images into a plurality of predetermined categories based on analyzing textual information and/or image characteristics of each of the plurality of document images using the classification rules, wherein the step of automatically categorizing comprises; producing an output score for each document image based on the analysis thereof using the classification rules, wherein each output score represents an estimated document type probability or a subdocument type probability; and using a graph search algorithm to determine an optimum categorization sequence from a plurality of possible categorization sequences for the plurality of document images based on the output scores; and separating documents within the plurality of document images from one another by either; electronically associating at least one computer-generated label with at least some of the plurality of document images, each label corresponding to a different one of the plurality of categories and comprising one of the one or more identifiers generated for identifying which of the plurality of document images belongs to which of the plurality of categories;
orinserting one or more computer-generated separation pages between at least some of the plurality of document images to delineate images belonging to different ones of the plurality of categories, each separation page comprising one of the one or more identifiers generated for identifying which of the plurality of document images belongs to which of the plurality of categories;
orboth electronically associating the at least one computer-generated label with at least some of the plurality of document images and inserting the one or more computer-generated separation pages between at least some of the plurality of document images. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. In a computer-based system, a method for automatically separating documents represented within a plurality of images by delineating document boundaries and identifying document types in accordance with classification rules, the method comprising:
-
automatically generating classification rules that predict a document type or subdocument type for each of the plurality of images based on textual information and/or graphical information represented in each respective one of the plurality of images, wherein the classification rules are generated based on analyzing textual information and/or graphical information of a plurality of training images using one or more of;
a probabilistic network;
relational algebra; and
machine learning techniques;obtaining the plurality of images; automatically categorizing a plurality of subdocument images into a plurality of predetermined categories based on analyzing textual infoiniation and/or image characteristics of each of the plurality of document images using the classification rules, wherein said step of automatically categorizing comprises; producing an output score for each subdocument image based on the analysis thereof using the classification rules, wherein each output score represents an estimated document type probability or a subdocument type probability; and using a graph search algorithm to determine an optimum categorization sequence from a plurality of possible categorization sequences for said plurality of subdocument images based on said output scores; automatically generating at least one identifier for identifying which of said plurality of subdocument images belongs to which of said plurality of predetermined categories; and separating subdocuments within the plurality of subdocument images from one another by either; electronically associating at least one computer-generated label with at least some of the plurality of subdocument images, each label corresponding to a different one of the plurality of categories and comprising one of the one or more identifiers generated for identifying which of the plurality of subdocument images belongs to which of the plurality of predetermined categories;
orinserting one or more computer-generated separation pages between at least some of the plurality of subdocument images to delineate images belonging to different ones of the plurality of categories, each separation page comprising one of the one or more identifiers generated for identifying which of the plurality of subdocument images belongs to which of the plurality of predetermined categories;
orboth electronically associating the at least one computer-generated label with at least some of the plurality of subdocument images and inserting the one or more computer-generated separation pages between at least some of the plurality of subdocument images. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A computer program product for automatically separating documents from subdocuments represented within a plurality of document and subdocument images by delineating document and subdocument boundaries and identifying document types in accordance with classification rules, the computer program product comprising a non-transitory computer readable storage medium having embodied thereon computer readable program code executable by a processor to cause the processor to:
-
separate subdocument images from document images, wherein the subdocument images and the document images are part of a single collection; automatically generate classification rules that predict a document type or subdocument type for each of the subdocument images and the document images based on textual information and/or graphical information represented in each respective one of the subdocument images and the document images, wherein the classification rules are generated based on analyzing textual infoiniation and/or graphical information of a plurality of training images using one or more of;
a probabilistic network;
relational algebra; and
machine learning techniques;automatically categorize a plurality of the subdocument images into a plurality of predetermined categories based on analyzing textual information and/or image characteristics of each of the plurality of document images using the classification rules, wherein the step of automatically categorizing comprises; producing an output score for each subdocument image based on the analysis thereof using the classification rules, wherein each output score represents an estimated document type probability or a subdocument type probability; and using a graph search algorithm to determine an optimum categorization sequence from a plurality of possible categorization sequences for the plurality of subdocument images based on the output scores; and generate at least one identifier for identifying which of said plurality of subdocument images belongs to which of said plurality of predetermined categories; and separating subdocuments within the plurality of subdocument images from one another by either; electronically associating at least one computer-generated label with at least some of the plurality of subdocument images, each label corresponding to a different one of the plurality of categories and comprising one of the one or more identifiers generated for identifying which of the plurality of subdocument images belongs to which of the plurality of predetermined categories;
orinserting one or more computer-generated separation pages between at least some of the plurality of subdocument images to delineate images belonging to different ones of the plurality of categories, each separation page comprising one of the one or more identifiers generated for identifying which of the plurality of subdocument images belongs to which of the plurality of predetermined categories;
orboth electronically associating the at least one computer-generated label with at least some of the plurality of subdocument images and inserting the one or more computer-generated separation pages between at least some of the plurality of subdocument images.
-
Specification