Lightweight document clustering
First Claim
Patent Images
1. A method for clustering similar documents together comprising the steps of:
- a) creating a document list containing for each document a linear list of a subset of keywords that appear in said each document;
b) creating a wordlist that contains a linear list of all the documents that contain on their respective linear list of a subset of keywords a particular keyword;
c) selecting a document in the document list;
d) determining a specified number of documents that are most similar to the document selected in step c), wherein the similarity is based upon a summation of the total number of times that each keyword appears in the documents not selected in step c) plus a bonus, which is the inverse document frequency (IDF);
e) repeating steps c) and d) for each document in the document list;
f) arranging the specified documents determined in each iteration of step d) into a plurality of clusters wherein each document within a cluster has at least one same common keyword with all other documents in the cluster;
j) assigning each document to a cluster; and
k) determining a number of match pairs for a first document, wherein a number of match pairs is the number of second documents that are on the top k matched list for that document;
for each match pair;
i) if the match score is less than a threshold minimum score, proceed to the next match pair;
ii) if the match pair is already in the same cluster, proceed to the next match pair;
iii) if the first document is in a cluster, and the second document is not, add the second document to the cluster that the first document is in, and proceed to the next match pair;
iv) if the first document and the second document are in separate clusters;
if option is no merging, go to the next match pair;
if option is repeat documents, replicate the second document in all clusters that the first document is in, and go to the next match pair;
or merging the two separate clusters into one cluster, and go to the next match pair;
1) repeating step k) for each document in the document list.
1 Assignment
0 Petitions
Accused Products
Abstract
A procedure for clustering documents that operates in high dimensions, processes tens of thousands of documents and groups them into several thousand clusters or, by varying a single parameter, into a few dozen clusters. The procedure is specified in two parts: computing a similarity score representing the k most similar documents (typically the top ten) for each document in the collection, and grouping the documents into clusters using the similarly scores.
-
Citations
7 Claims
-
1. A method for clustering similar documents together comprising the steps of:
-
a) creating a document list containing for each document a linear list of a subset of keywords that appear in said each document;
b) creating a wordlist that contains a linear list of all the documents that contain on their respective linear list of a subset of keywords a particular keyword;
c) selecting a document in the document list;
d) determining a specified number of documents that are most similar to the document selected in step c), wherein the similarity is based upon a summation of the total number of times that each keyword appears in the documents not selected in step c) plus a bonus, which is the inverse document frequency (IDF);
e) repeating steps c) and d) for each document in the document list;
f) arranging the specified documents determined in each iteration of step d) into a plurality of clusters wherein each document within a cluster has at least one same common keyword with all other documents in the cluster;
j) assigning each document to a cluster; and
k) determining a number of match pairs for a first document, wherein a number of match pairs is the number of second documents that are on the top k matched list for that document;
for each match pair;
i) if the match score is less than a threshold minimum score, proceed to the next match pair;
ii) if the match pair is already in the same cluster, proceed to the next match pair;
iii) if the first document is in a cluster, and the second document is not, add the second document to the cluster that the first document is in, and proceed to the next match pair;
iv) if the first document and the second document are in separate clusters;
if option is no merging, go to the next match pair;
if option is repeat documents, replicate the second document in all clusters that the first document is in, and go to the next match pair;
ormerging the two separate clusters into one cluster, and go to the next match pair;
1) repeating step k) for each document in the document list. - View Dependent Claims (2, 3, 4)
g) for each keyword in the document selected in step c), accessing the wordlist to calculate a match score for each document referenced therein, wherein the match score is a summation of the total number of times that each keyword appears plus a bonus, which is the IDF;
h) ranking each document in the document list according to the respective match score of each document, and i) selecting a specified number of documents having the highest match score.
-
-
3. The method recited in claim 1, wherein said linear list of a subset of keywords for each document is comprised of the k most frequent words in the document.
-
4. The method recited in claim 1, wherein said linear list of a subset of keywords for each document is comprised of the k most predictive words in the document.
-
5. A computer program product comprising a computer usable medium having computer readable program code embodied in the medium for objectively grouping similar documents into clusters, the computer program product having:
-
a) first computer program code for creating a document list containing for each document a linear list of a subset of keywords that appear in said each document;
b) second computer program code creating a wordlist that contains a linear list of all the documents that contain on their respective linear list of a subset of keywords a particular keyword;
c) third computer program code selecting a document in the document list;
d) fourth computer program code determining a specified number of documents that are most similar to the document selected by the third computer program code, wherein the similarity is based upon a summation of the total number of times that each keyword appears in the documents not selected in step c) plus a bonus, which is the inverse document frequency (IDF);
e) fifth computer program code for executing the third and fourth computer program code for each document in the document list;
f) sixth computer program code arranging the specified documents determined in each execution of the fourth computer program code into a plurality of clusters wherein each document within a cluster has at least one same common keyword with all other documents in the cluster, wherein the fourth computer program product comprises g) seventh computer program code that, for each keyword determined by the third computer program code, accesses the wordlist to calculate a match score for each document referenced therein, wherein the match score is a summation of the total number of times that each keyword appears plus a bonus, which is the IDF;
h) eighth computer program code for ranking each document in the document list according to the respective match score of each document;
i) ninth computer program code for selecting a specified number of documents having the highest match score;
wherein said sixth computer program product comprises; i) tenth computer program code for assigning each document to a cluster;
k) eleventh computer program code for determining a number of match pairs for a first document, wherein a number of match pairs is the number of second documents that are on the top k matched list for that document, such that for each match pair;
i) if the match score is less than a threshold minimum score, proceed to the next match pair;
ii) if the match pair is already in the same cluster, proceed to the next match pair;
iii) if the first document is in a cluster, and the second document is not, add the second document to the cluster that the first document is in, and proceed to the next match pair;
iv) if the first document and the second document are is separate clusters;
if option is no merging, go to the next match pair;
if option is repeat documents, replicate the second document in all clusters that the first document is in, and go to the next match pair;
ormerging the two separate clusters into one cluster, and go to the next match pair;
1) twelfth computer program code for re-executing the eleventh computer code for each document in the document list. - View Dependent Claims (6, 7)
-
Specification