DOCUMENT CLUSTERING AND RECONSTRUCTION
First Claim
1. A method comprising:
- receiving a plurality of documents;
for each of the plurality of documents, identifying a plurality of objects and locations of each of the plurality of objects;
determining occurrences of similar objects in the identified locations of the plurality of objects between the plurality of documents;
applying a document sorting algorithm to generate a score for each of the plurality of documents, wherein the score for each of the plurality of documents is generated based on a number of occurrences of similar objects between the plurality of documents; and
comparing the generated score of each of the plurality of documents to identify a template document.
1 Assignment
0 Petitions
Accused Products
Abstract
A scanner scans a group of documents. For example, the documents can be a group of invoices. The documents are received and processed. Objects (e.g., a text object, such as a word) and their locations are identified in each of the documents. Occurrences of similar objects in the identified locations between the documents are determined. A document sorting algorithm is applied to generate a score for each of the documents. The score for each of the documents is generated based on a number of occurrences of similar objects between the documents. The generated score of each of the documents is used to identify a template document. The template document is then used to cluster the documents.
-
Citations
20 Claims
-
1. A method comprising:
-
receiving a plurality of documents; for each of the plurality of documents, identifying a plurality of objects and locations of each of the plurality of objects; determining occurrences of similar objects in the identified locations of the plurality of objects between the plurality of documents; applying a document sorting algorithm to generate a score for each of the plurality of documents, wherein the score for each of the plurality of documents is generated based on a number of occurrences of similar objects between the plurality of documents; and comparing the generated score of each of the plurality of documents to identify a template document. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system comprising:
-
a document processor configured to receive a plurality of documents, for each of the plurality of documents, identify a plurality of objects and locations of each of the plurality of objects, determining occurrences of similar objects in the identified locations of the plurality of objects between the plurality of documents, and apply a document sorting algorithm to generate a score for each of the plurality of documents, wherein the score for each of the plurality of documents is generated based on a number of occurrences of similar objects between the plurality of documents; and a document classifier configured to compare the generated score of each of the plurality of documents to identify a template document. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A system comprising:
-
a scanner configured to scan a plurality of documents; a document processor configured to receive the plurality of documents, for each of the plurality of documents, identify a plurality of objects and locations of each of the plurality of objects, determine occurrences of similar objects in the identified locations of the plurality of objects between the plurality of documents, apply a document sorting algorithm to generate a score for each of the plurality of documents, wherein the score for each of the plurality of documents is generated based on a number of occurrences of similar objects between the plurality of documents, determine an amount of certainty for an occurrence of similar objects in a common object document location between the plurality of documents, identify the common object document location based on a minimum certainty threshold value, determine that the template document contains an error for an individual object in the common object document location in the template document, and replace the individual object in the common object document location in the template document with a second object in response to determining that the template document contains an error for the individual object in the common object document location in the template document, wherein the second object is from the common object location in a second one of the plurality of documents that has been determined to be correct; and a document classifier configured to compare the generated score of each of the plurality of documents to identify a template document and cluster the plurality of documents based on the template document.
-
Specification