Method and system for clustering identified forms
First Claim
1. A device for organizing a plurality of documents that include forms, the device comprising:
- a processor;
a non-transitory computer-readable medium operably coupled to the processor, the computer-readable medium comprising instructions that, upon execution by the processor, perform operations comprising(a) identifying a first form in a first document selected from a plurality of documents;
(b) calculating a first similarity value in a first feature space between a first cluster selected from a plurality of clusters defined for the plurality of documents and the first document, the first feature space associated with a content of the first document and a content of a document assigned to the first cluster;
(c) calculating a second similarity value in a second feature space between the first cluster and the first document, the second feature space associated with a content of the identified first form and a content of a form in the document assigned to the first cluster;
(d) calculating a similarity value between the first document and the first cluster based on the calculated first similarity value and the calculated second similarity value;
(e) repeating (b)-(d) with each of the plurality of clusters as the first cluster selected from the plurality of clusters;
(f) determining a cluster of the plurality of clusters to which to assign the first document based on the calculated similarity value for each of the plurality of clusters; and
repeat (a)-(f) for each of the plurality of documents as the first document selected from the plurality of documents until the assignments become stable, wherein determining if the assignments become stable includes calculating a number of documents of the plurality of documents that are assigned to a different cluster in (f).
3 Assignments
0 Petitions
Accused Products
Abstract
A method is provided for organizing a plurality of documents that include forms. An initial set of clusters is defined for the plurality of documents. The initial set of clusters is reclustered based on similarity values calculated in multiple feature spaces. For example, a first feature space may be associated with a content of a document while a second feature space may be associated with a content of a form associated with the document. Each cluster has an associated centroid vector in each feature space that is used to represent the cluster. The similarity between the document and each cluster is calculated in both feature spaces. Each document is assigned to the cluster whose centroid is most similar. The cluster centroids may be recalculated and the process repeated until the cluster assignments become stable.
41 Citations
32 Claims
-
1. A device for organizing a plurality of documents that include forms, the device comprising:
-
a processor; a non-transitory computer-readable medium operably coupled to the processor, the computer-readable medium comprising instructions that, upon execution by the processor, perform operations comprising (a) identifying a first form in a first document selected from a plurality of documents; (b) calculating a first similarity value in a first feature space between a first cluster selected from a plurality of clusters defined for the plurality of documents and the first document, the first feature space associated with a content of the first document and a content of a document assigned to the first cluster; (c) calculating a second similarity value in a second feature space between the first cluster and the first document, the second feature space associated with a content of the identified first form and a content of a form in the document assigned to the first cluster; (d) calculating a similarity value between the first document and the first cluster based on the calculated first similarity value and the calculated second similarity value; (e) repeating (b)-(d) with each of the plurality of clusters as the first cluster selected from the plurality of clusters; (f) determining a cluster of the plurality of clusters to which to assign the first document based on the calculated similarity value for each of the plurality of clusters; and repeat (a)-(f) for each of the plurality of documents as the first document selected from the plurality of documents until the assignments become stable, wherein determining if the assignments become stable includes calculating a number of documents of the plurality of documents that are assigned to a different cluster in (f).
-
-
2. A non-transitory computer-readable medium comprising computer-readable instructions therein that, upon execution by a processor, cause the processor to organize a plurality of documents that include forms, the instructions configured to cause a computing device to:
-
(a) identify a first form in a first document selected from a plurality of documents; (b) calculate a first similarity value in a first feature space between a first cluster selected from a plurality of clusters defined for the plurality of documents and the first document, the first feature space associated with a content of the first document and a content of a document assigned to the first cluster; (c) calculate a second similarity value in a second feature space between the first cluster and the first document, the second feature space associated with a content of the identified first form and a content of a form in the document assigned to the first cluster; (d) calculate a similarity value between the first document and the first cluster based on the calculated first similarity value and the calculated second similarity value; (e) repeat (b)-(d) with each of the plurality of clusters as the first cluster selected from the plurality of clusters; (f) determine a cluster of the plurality of clusters to which to assign the first document based on the calculated similarity value for each of the plurality of clusters; and repeat (a)-(f) for each of the plurality of documents as the first document selected from the plurality of documents until the assignments become stable, wherein determining if the assignments become stable includes calculating a number of documents of the plurality of documents that are assigned to a different cluster in (f).
-
-
3. A method of organizing a plurality of documents that include forms, the method comprising:
-
(a) identifying, by a computing device, a first form in a first document selected from a plurality of documents; (b) calculating, by the computing device, a first similarity value in a first feature space between a first cluster selected from a plurality of clusters defined for the plurality of documents and the first document, the first feature space associated with a content of the first document and a content of a document assigned to the first cluster; (c) calculating, by the computing device, a second similarity value in a second feature space between the first cluster and the first document, the second feature space associated with a content of the identified first form and a content of a form in the document assigned to the first cluster; (d) calculating, by the computing device, a similarity value between the first document and the first cluster based on the calculated first similarity value and the calculated second similarity value; (e) repeating (b)-(d), by the computing device, with each of the plurality of clusters as the first cluster selected from the plurality of clusters; (f) determining, by the computing device, a cluster of the plurality of clusters to which to assign the first document based on the calculated similarity value for each of the plurality of clusters; and repeating (a)-(f) for each of the plurality of documents as the first document selected from the plurality of documents until the assignments become stable, wherein determining if the assignments become stable includes calculating a number of documents of the plurality of documents that are assigned to a different cluster in (f). - View Dependent Claims (4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)
-
Specification