Automatic document separation

US 9,910,829 B2
Filed: 02/14/2014
Issued: 03/06/2018
Est. Priority Date: 12/19/2003
Status: Active Grant

First Claim

Patent Images

1. A method for automatically separating documents represented within a plurality of images by delineating document boundaries and identifying document types in accordance with classification rules, the method comprising:

automatically generating classification rules that predict a document type or subdocument type for each of the plurality of images based on textual information and/or graphical information represented in each respective one of the plurality of images, wherein the classification rules are generated based on analyzing textual information and/or graphical information of a plurality of training images using one or more of;

a probabilistic network;

relational algebra; and

machine learning techniquesautomatically generating one or more identifiers for identifying which of a plurality of document images belongs to which of a plurality of categories;

automatically categorizing a plurality of document images into a plurality of predetermined categories based on analyzing textual information and/or image characteristics of each of the plurality of document images using the classification rules, wherein the step of automatically categorizing comprises;

producing an output score for each document image based on the analysis thereof using the classification rules, wherein each output score represents an estimated document type probability or a subdocument type probability; and

using a graph search algorithm to determine an optimum categorization sequence from a plurality of possible categorization sequences for the plurality of document images based on the output scores; and

separating documents within the plurality of document images from one another by either;

electronically associating at least one computer-generated label with at least some of the plurality of document images, each label corresponding to a different one of the plurality of categories and comprising one of the one or more identifiers generated for identifying which of the plurality of document images belongs to which of the plurality of categories;

orinserting one or more computer-generated separation pages between at least some of the plurality of document images to delineate images belonging to different ones of the plurality of categories, each separation page comprising one of the one or more identifiers generated for identifying which of the plurality of document images belongs to which of the plurality of categories;

orboth electronically associating the at least one computer-generated label with at least some of the plurality of document images and inserting the one or more computer-generated separation pages between at least some of the plurality of document images.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for delineating document and/or subdocument boundaries and identifying document and/or subdocument types, the method comprising: automatically generating at least one identifier for identifying which of a plurality of document and/or subdocument images belongs to which of a plurality of categories. The method and/or system optionally may include automatically categorizing a plurality of document and/or subdocument images into a plurality of predetermined categories in accordance with classification rules for said categories.

40 Citations

View as Search Results

21 Claims

1. A method for automatically separating documents represented within a plurality of images by delineating document boundaries and identifying document types in accordance with classification rules, the method comprising:
- automatically generating classification rules that predict a document type or subdocument type for each of the plurality of images based on textual information and/or graphical information represented in each respective one of the plurality of images, wherein the classification rules are generated based on analyzing textual information and/or graphical information of a plurality of training images using one or more of;
  
  a probabilistic network;
  
  relational algebra; and
  
  machine learning techniquesautomatically generating one or more identifiers for identifying which of a plurality of document images belongs to which of a plurality of categories;
  
  automatically categorizing a plurality of document images into a plurality of predetermined categories based on analyzing textual information and/or image characteristics of each of the plurality of document images using the classification rules, wherein the step of automatically categorizing comprises;
  
  producing an output score for each document image based on the analysis thereof using the classification rules, wherein each output score represents an estimated document type probability or a subdocument type probability; and
  
  using a graph search algorithm to determine an optimum categorization sequence from a plurality of possible categorization sequences for the plurality of document images based on the output scores; and
  
  separating documents within the plurality of document images from one another by either;
  
  electronically associating at least one computer-generated label with at least some of the plurality of document images, each label corresponding to a different one of the plurality of categories and comprising one of the one or more identifiers generated for identifying which of the plurality of document images belongs to which of the plurality of categories;
  
  orinserting one or more computer-generated separation pages between at least some of the plurality of document images to delineate images belonging to different ones of the plurality of categories, each separation page comprising one of the one or more identifiers generated for identifying which of the plurality of document images belongs to which of the plurality of categories;
  
  orboth electronically associating the at least one computer-generated label with at least some of the plurality of document images and inserting the one or more computer-generated separation pages between at least some of the plurality of document images.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, wherein the one or more identifiers further comprise the one or more computer-generated separation pages.
  - 3. The method of claim 1, wherein at least some of the document images independently consist of a subdocument selected from a plurality of subdocuments;
    - andwherein the plurality of subdocuments comprise a plurality of forms located within a single document.
  - 4. The method of claim 1, wherein the one or more identifiers further comprise a computer-readable description that identifies a categorization sequence for the plurality of document images in accordance with their categorization;
    - andwherein the computer-readable description comprises an XML message.
  - 5. The method of claim 1, wherein the one or more identifiers comprise the at least one computer-generated label.
  - 6. The method of claim 1, wherein the plurality of categories comprises at least two different form types used in a financial transaction;
    - andwherein the plurality of categories further comprise first, middle and end page categories for each of the at least two different form types.
  - 7. The method of claim 1, wherein automatically generating the one or more identifiers is based at least in part on both of:
    - graphical information corresponding to the plurality of document images; and
      
      textual information corresponding to the plurality of document images.
  - 8. The method of claim 1, wherein the output scores represents a probability that each document image belongs to at least one respective category from the plurality of categories.
  - 9. The method of claim 1, wherein the step of using a graph search algorithm comprises:
    - using a graph structure to calculate a total output score, based on the output scores for each of the plurality of document images, for each the possible categorization sequence; and
      
      determining which categorization sequence yields the highest total output score.
  - 10. The method of claim 1, wherein at least some of the document images independently consist of a subdocument selected from a plurality of subdocuments;
    - wherein at least some of the plurality of subdocuments comprise an entire page of a document; and
      
      wherein at least some other of the plurality of subdocuments comprise a form of the document.
  - 11. The method of claim 1, wherein automatically generating the one or more identifiers is based at least in part on information selected from a group consisting of:
    - subdocument sequence information;
      
      a subdocument frequency; and
      
      a subdocument length distribution.

12. In a computer-based system, a method for automatically separating documents represented within a plurality of images by delineating document boundaries and identifying document types in accordance with classification rules, the method comprising:
- automatically generating classification rules that predict a document type or subdocument type for each of the plurality of images based on textual information and/or graphical information represented in each respective one of the plurality of images, wherein the classification rules are generated based on analyzing textual information and/or graphical information of a plurality of training images using one or more of;
  
  a probabilistic network;
  
  relational algebra; and
  
  machine learning techniques;
  
  obtaining the plurality of images;
  
  automatically categorizing a plurality of subdocument images into a plurality of predetermined categories based on analyzing textual infoiniation and/or image characteristics of each of the plurality of document images using the classification rules, wherein said step of automatically categorizing comprises;
  
  producing an output score for each subdocument image based on the analysis thereof using the classification rules, wherein each output score represents an estimated document type probability or a subdocument type probability; and
  
  using a graph search algorithm to determine an optimum categorization sequence from a plurality of possible categorization sequences for said plurality of subdocument images based on said output scores;
  
  automatically generating at least one identifier for identifying which of said plurality of subdocument images belongs to which of said plurality of predetermined categories; and
  
  separating subdocuments within the plurality of subdocument images from one another by either;
  
  electronically associating at least one computer-generated label with at least some of the plurality of subdocument images, each label corresponding to a different one of the plurality of categories and comprising one of the one or more identifiers generated for identifying which of the plurality of subdocument images belongs to which of the plurality of predetermined categories;
  
  orinserting one or more computer-generated separation pages between at least some of the plurality of subdocument images to delineate images belonging to different ones of the plurality of categories, each separation page comprising one of the one or more identifiers generated for identifying which of the plurality of subdocument images belongs to which of the plurality of predetermined categories;
  
  orboth electronically associating the at least one computer-generated label with at least some of the plurality of subdocument images and inserting the one or more computer-generated separation pages between at least some of the plurality of subdocument images.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20)
- - 13. The method of claim 12, wherein said one or more identifiers comprises the one or more computer-generated separation pages.
  - 14. The method of claim 12, wherein said one or more identifiers further comprise a computer-readable description that identifies a categorization sequence for said plurality of document images in accordance with their categorization;
    - andwherein said computer-readable description comprises an XML message.
  - 15. The method of claim 12, wherein at least some of the plurality of document images independently consist of a subdocument selected from a plurality of subdocuments;
    - andwherein the plurality of subdocuments are located within a single document.
  - 16. The method of claim 12, wherein said one or more identifiers comprise the at least one computer-generated label.
  - 17. The method of claim 12, wherein said plurality of predetermined categories comprises at least a portion of two different forms used in a financial transaction;
    - andwherein said plurality of predetermined categories further comprise at least two different form types.
  - 18. The method of claim 12, wherein at least some of the plurality of document images independently consist of a subdocument selected from a plurality of subdocuments, and the method comprising separating the subdocuments from a plurality of documents.
  - 19. The method of claim 12, wherein said output scores represent a probability that each subdocument image belongs to at least one respective category from said plurality of predetermined categories.
  - 20. The method of claim 12, wherein said step of using a graph search algorithm comprises:
    - using a graph structure to calculate a total output score, based on said output scores for each of said plurality of subdocument images, for each said possible categorization sequence; and
      
      determining which categorization sequence yields the highest total output score.

21. A computer program product for automatically separating documents from subdocuments represented within a plurality of document and subdocument images by delineating document and subdocument boundaries and identifying document types in accordance with classification rules, the computer program product comprising a non-transitory computer readable storage medium having embodied thereon computer readable program code executable by a processor to cause the processor to:
- separate subdocument images from document images, wherein the subdocument images and the document images are part of a single collection;
  
  automatically generate classification rules that predict a document type or subdocument type for each of the subdocument images and the document images based on textual information and/or graphical information represented in each respective one of the subdocument images and the document images, wherein the classification rules are generated based on analyzing textual infoiniation and/or graphical information of a plurality of training images using one or more of;
  
  a probabilistic network;
  
  relational algebra; and
  
  machine learning techniques;
  
  automatically categorize a plurality of the subdocument images into a plurality of predetermined categories based on analyzing textual information and/or image characteristics of each of the plurality of document images using the classification rules, wherein the step of automatically categorizing comprises;
  
  producing an output score for each subdocument image based on the analysis thereof using the classification rules, wherein each output score represents an estimated document type probability or a subdocument type probability; and
  
  using a graph search algorithm to determine an optimum categorization sequence from a plurality of possible categorization sequences for the plurality of subdocument images based on the output scores; and
  
  generate at least one identifier for identifying which of said plurality of subdocument images belongs to which of said plurality of predetermined categories; and
  
  separating subdocuments within the plurality of subdocument images from one another by either;
  
  electronically associating at least one computer-generated label with at least some of the plurality of subdocument images, each label corresponding to a different one of the plurality of categories and comprising one of the one or more identifiers generated for identifying which of the plurality of subdocument images belongs to which of the plurality of predetermined categories;
  
  orinserting one or more computer-generated separation pages between at least some of the plurality of subdocument images to delineate images belonging to different ones of the plurality of categories, each separation page comprising one of the one or more identifiers generated for identifying which of the plurality of subdocument images belongs to which of the plurality of predetermined categories;
  
  orboth electronically associating the at least one computer-generated label with at least some of the plurality of subdocument images and inserting the one or more computer-generated separation pages between at least some of the plurality of subdocument images.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Kofax Incorporated
Original Assignee
Kofax Incorporated
Inventors
Schmidtler, Mauritius A. R., Texeira, Scott S., Harris, Christopher K., Samat, Sameer, Borrey, Roland G., Macciola, Anthony
Primary Examiner(s)
Demeter, Hilina K

Application Number

US14/181,497
Publication Number

US 20140164914A1
Time in Patent Office

1,481 Days
Field of Search

None
US Class Current
CPC Class Codes

G06F 40/10   Text processing natural lan...

G06V 30/40   Document-oriented image-bas...

H04N 1/32112   in a separate computer file...

H04N 2201/3225   of data relating to an imag...

H04N 2201/3243   of type information, e.g. h...

Automatic document separation

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

40 Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Automatic document separation

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

40 Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links