Clustering of forms from large-scale scanned-document collection
First Claim
1. A computer-implemented method of identifying documents sharing a common underlying structure, comprising:
- detecting occurrences of a plurality of predetermined image features in a plurality of document images, wherein at least one of the plurality of predetermined image features is common among instances of a form;
indexing the plurality of document images in an image index based on the detected image features;
building a graph of connected nodes for the plurality of document images by searching the image index;
identifying the documents sharing the common underlying structure using the graph;
reproducing the common underlying structure shared by the identified documents; and
generating improved images of the identified documents by overlaying the reproduced common underlying structure on document images of the identified documents.
2 Assignments
0 Petitions
Accused Products
Abstract
Techniques for identifying documents sharing common underlying structures in a large collection of documents and processing the documents using the identified structures are disclosed. Images of the document collection are processed to detect occurrences of a predetermined set of image features that are common or similar among forms. The images are then indexed in an image index based on the detected image features. A graph of nodes is built. Nodes in the graph represent images and are connected to nodes representing similar document images by edges. Documents sharing common underlying structures are identified by gathering strongly inter-connected nodes in the graph. The identified documents are processed based at least in part on the resulting clusters.
32 Citations
19 Claims
-
1. A computer-implemented method of identifying documents sharing a common underlying structure, comprising:
-
detecting occurrences of a plurality of predetermined image features in a plurality of document images, wherein at least one of the plurality of predetermined image features is common among instances of a form; indexing the plurality of document images in an image index based on the detected image features; building a graph of connected nodes for the plurality of document images by searching the image index; identifying the documents sharing the common underlying structure using the graph; reproducing the common underlying structure shared by the identified documents; and generating improved images of the identified documents by overlaying the reproduced common underlying structure on document images of the identified documents. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computer system for identifying documents sharing a common underlying structure, comprising:
-
a non-transitory computer-readable storage medium comprising executable computer program code for; detecting occurrences of a plurality of predetermined image features in a plurality of document images, wherein at least one of the plurality of predetermined image features is common among instances of a form; indexing the plurality of document images in an image index based on the detected image features; building a graph of connected nodes for the plurality of document images by searching the image index; identifying the documents sharing the common underlying structure using the graph; reproducing the common underlying structure shared by the identified documents; and generating improved images of the identified documents by overlaying the reproduced common underlying structure on document images of the identified documents; and a processor for executing the computer program code. - View Dependent Claims (12, 13, 14, 15)
-
-
16. A non-transitory computer-readable storage medium storing executable computer program instructions for identifying documents sharing at least one a common underlying structure, the computer program instructions comprising instructions for:
-
detecting occurrences of a plurality of predetermined image features in a plurality of document images, wherein at least one of the plurality of predetermined image features is common among instances of a form; indexing the plurality of document images in an image index based on the detected image features; building a graph of connected nodes for the plurality of document images by searching the image index, wherein nodes representing instances of a predefined document type are connected by edges in the graph; identifying the documents sharing the common underlying structure using the graph; reproducing the common underlying structure shared by the identified documents; and generating improved images of the identified documents by overlaying the reproduced common underlying structure on document images of the identified documents. - View Dependent Claims (17, 18, 19)
-
Specification