Method and system for using keywords to merge document clusters
First Claim
1. A system for using keywords to merge document clusters, the system comprising:
- one or more processors; and
a non-transitory computer readable medium storing a plurality of instructions, which when executed, cause the one or more processors to;
distribute a plurality of documents into a plurality of document clusters, wherein the plurality of document clusters comprise a first document cluster comprising a first plurality of documents and a second document cluster comprising a second plurality of documents;
create a template associated with the first document cluster, wherein the template comprises a plurality of keywords associated with at least most of the first plurality of documents;
calculate a distance between keyword location information associated with the template and word location information associated with a document in the second document cluster, wherein the keyword location information comprises information indicating a location of a keyword in the template relative to other keywords in the template, and wherein the word location information comprises information indicating a location of a word in the document relative to other words in the document;
determine whether the distance is less than a threshold value; and
merge the second document cluster with the first document cluster in response to a determination that the distance is less than the threshold value.
12 Assignments
0 Petitions
Accused Products
Abstract
Using keywords to merge document clusters is described. Documents are distributed into document clusters that include a first document cluster of first documents and a second document cluster of second documents. A template associated with the first document cluster is created. The template includes keywords associated with most of the first documents. A distance is calculated between keyword location information associated with the template and word location information associated with a document in the second document cluster. The keyword location information includes information indicating a location of a keyword in the template relative to other keywords in the template. The word location information includes information indicating a location of a word in the document relative to other words in the document. A determination is made whether the distance is less than a threshold value. The second document cluster is merged with the first document cluster in response to the determination that the distance is less than the threshold value.
-
Citations
20 Claims
-
1. A system for using keywords to merge document clusters, the system comprising:
-
one or more processors; and a non-transitory computer readable medium storing a plurality of instructions, which when executed, cause the one or more processors to; distribute a plurality of documents into a plurality of document clusters, wherein the plurality of document clusters comprise a first document cluster comprising a first plurality of documents and a second document cluster comprising a second plurality of documents; create a template associated with the first document cluster, wherein the template comprises a plurality of keywords associated with at least most of the first plurality of documents; calculate a distance between keyword location information associated with the template and word location information associated with a document in the second document cluster, wherein the keyword location information comprises information indicating a location of a keyword in the template relative to other keywords in the template, and wherein the word location information comprises information indicating a location of a word in the document relative to other words in the document; determine whether the distance is less than a threshold value; and merge the second document cluster with the first document cluster in response to a determination that the distance is less than the threshold value. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer-implemented method for using keywords to merge document clusters, the method comprising:
-
distributing a plurality of documents into a plurality of document clusters, wherein the plurality of document clusters comprise a first document cluster comprising a first plurality of documents and a second document cluster comprising a second plurality of documents; creating a template associated with the first document cluster, wherein the template comprises a plurality of keywords associated with at least most of the first plurality of documents; calculating a distance between keyword location information associated with the template and word location information associated with a document in the second document cluster, wherein the keyword location information comprises information indicating a location of a keyword in the template relative to other keywords in the template, and wherein the word location information comprises information indicating a location of a word in the document relative to other words in the document; determining whether the distance is less than a threshold value; and merging the second document cluster with the first document cluster in response to a determination that the distance is less than the threshold value. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A computer program product, comprising a non-transitory computer-readable medium having a computer-readable program code embodied therein, the computer-readable program code adapted to be executed by one or more processors, the program code comprising instructions to:
-
distribute a plurality of documents into a plurality of document clusters, wherein the plurality of document clusters comprise a first document cluster comprising a first plurality of documents and a second document cluster comprising a second plurality of documents; create a template associated with the first document cluster, wherein the template comprises a plurality of keywords associated with at least most of the first plurality of documents; calculate a distance between keyword location information associated with the template and word location information associated with a document in the second document cluster, wherein the keyword location information comprises information indicating a location of a keyword in the template relative to other keywords in the template, and wherein the word location information comprises information indicating a location of a word in the document relative to other words in the document; determine whether the distance is less than a threshold value; and merge the second document cluster with the first document cluster in response to a determination that the distance is less than the threshold value. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification