Lightweight document clustering

US 6,654,739 B1
Filed: 01/31/2000
Issued: 11/25/2003
Est. Priority Date: 01/31/2000
Status: Expired due to Fees

First Claim

Patent Images

1. A method for clustering similar documents together comprising the steps of:

a) creating a document list containing for each document a linear list of a subset of keywords that appear in said each document;

b) creating a wordlist that contains a linear list of all the documents that contain on their respective linear list of a subset of keywords a particular keyword;

c) selecting a document in the document list;

d) determining a specified number of documents that are most similar to the document selected in step c), wherein the similarity is based upon a summation of the total number of times that each keyword appears in the documents not selected in step c) plus a bonus, which is the inverse document frequency (IDF);

e) repeating steps c) and d) for each document in the document list;

f) arranging the specified documents determined in each iteration of step d) into a plurality of clusters wherein each document within a cluster has at least one same common keyword with all other documents in the cluster;

j) assigning each document to a cluster; and

k) determining a number of match pairs for a first document, wherein a number of match pairs is the number of second documents that are on the top k matched list for that document;

for each match pair;

i) if the match score is less than a threshold minimum score, proceed to the next match pair;

ii) if the match pair is already in the same cluster, proceed to the next match pair;

iii) if the first document is in a cluster, and the second document is not, add the second document to the cluster that the first document is in, and proceed to the next match pair;

iv) if the first document and the second document are in separate clusters;

if option is no merging, go to the next match pair;

if option is repeat documents, replicate the second document in all clusters that the first document is in, and go to the next match pair;

or merging the two separate clusters into one cluster, and go to the next match pair;

1) repeating step k) for each document in the document list.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A procedure for clustering documents that operates in high dimensions, processes tens of thousands of documents and groups them into several thousand clusters or, by varying a single parameter, into a few dozen clusters. The procedure is specified in two parts: computing a similarity score representing the k most similar documents (typically the top ten) for each document in the collection, and grouping the documents into clusters using the similarly scores.

Citations

7 Claims

1. A method for clustering similar documents together comprising the steps of:
- a) creating a document list containing for each document a linear list of a subset of keywords that appear in said each document;
  
  b) creating a wordlist that contains a linear list of all the documents that contain on their respective linear list of a subset of keywords a particular keyword;
  
  c) selecting a document in the document list;
  
  d) determining a specified number of documents that are most similar to the document selected in step c), wherein the similarity is based upon a summation of the total number of times that each keyword appears in the documents not selected in step c) plus a bonus, which is the inverse document frequency (IDF);
  
  e) repeating steps c) and d) for each document in the document list;
  
  f) arranging the specified documents determined in each iteration of step d) into a plurality of clusters wherein each document within a cluster has at least one same common keyword with all other documents in the cluster;
  
  j) assigning each document to a cluster; and
  
  k) determining a number of match pairs for a first document, wherein a number of match pairs is the number of second documents that are on the top k matched list for that document;
  
  for each match pair;
  
  i) if the match score is less than a threshold minimum score, proceed to the next match pair;
  
  ii) if the match pair is already in the same cluster, proceed to the next match pair;
  
  iii) if the first document is in a cluster, and the second document is not, add the second document to the cluster that the first document is in, and proceed to the next match pair;
  
  iv) if the first document and the second document are in separate clusters;
  
  if option is no merging, go to the next match pair;
  
  if option is repeat documents, replicate the second document in all clusters that the first document is in, and go to the next match pair;
  
  or merging the two separate clusters into one cluster, and go to the next match pair;
  
  1) repeating step k) for each document in the document list.
- View Dependent Claims (2, 3, 4)
- - 2. The method as recited in claim 1, wherein the determining step comprises the steps of:
3. The method recited in claim 1, wherein said linear list of a subset of keywords for each document is comprised of the k most frequent words in the document.
4. The method recited in claim 1, wherein said linear list of a subset of keywords for each document is comprised of the k most predictive words in the document.

5. A computer program product comprising a computer usable medium having computer readable program code embodied in the medium for objectively grouping similar documents into clusters, the computer program product having:
- a) first computer program code for creating a document list containing for each document a linear list of a subset of keywords that appear in said each document;
  
  b) second computer program code creating a wordlist that contains a linear list of all the documents that contain on their respective linear list of a subset of keywords a particular keyword;
  
  c) third computer program code selecting a document in the document list;
  
  d) fourth computer program code determining a specified number of documents that are most similar to the document selected by the third computer program code, wherein the similarity is based upon a summation of the total number of times that each keyword appears in the documents not selected in step c) plus a bonus, which is the inverse document frequency (IDF);
  
  e) fifth computer program code for executing the third and fourth computer program code for each document in the document list;
  
  f) sixth computer program code arranging the specified documents determined in each execution of the fourth computer program code into a plurality of clusters wherein each document within a cluster has at least one same common keyword with all other documents in the cluster, wherein the fourth computer program product comprisesg) seventh computer program code that, for each keyword determined by the third computer program code, accesses the wordlist to calculate a match score for each document referenced therein, wherein the match score is a summation of the total number of times that each keyword appears plus a bonus, which is the IDF;
  
  h) eighth computer program code for ranking each document in the document list according to the respective match score of each document;
  
  i) ninth computer program code for selecting a specified number of documents having the highest match score;
  
  wherein said sixth computer program product comprises;
  
  i) tenth computer program code for assigning each document to a cluster;
  
  k) eleventh computer program code for determining a number of match pairs for a first document, wherein a number of match pairs is the number of second documents that are on the top k matched list for that document, such that for each match pair;
  
  i) if the match score is less than a threshold minimum score, proceed to the next match pair;
  
  ii) if the match pair is already in the same cluster, proceed to the next match pair;
  
  iii) if the first document is in a cluster, and the second document is not, add the second document to the cluster that the first document is in, and proceed to the next match pair;
  
  iv) if the first document and the second document are is separate clusters;
  
  if option is no merging, go to the next match pair;
  
  if option is repeat documents, replicate the second document in all clusters that the first document is in, and go to the next match pair;
  
  or merging the two separate clusters into one cluster, and go to the next match pair;
  
  1) twelfth computer program code for re-executing the eleventh computer code for each document in the document list.
- View Dependent Claims (6, 7)
- - 6. The computer program recited in claim 5, wherein said linear list of a subset of keywords for each document is comprised of the k most frequent words in the document.
  - 7. The computer program recited in claim 5, wherein said linear list of a subset of keywords for each document is comprised of the k most predictive words in the document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Weiss, Sholom M., Apte, Chidanand, White, Brian F.
Primary Examiner(s)
Alam, Shahid Al
Assistant Examiner(s)
WOO, ISAAC M

Application Number

US09/494,744
Time in Patent Office

1,394 Days
Field of Search

707/1, 707/200, 707/2, 707/205, 707/3, 707/4, 707/5, 707/100, 707/101, 707/103, 707/102, 707/500, 707/501, 707/502
US Class Current

1/1
CPC Class Codes

G06F 16/355   Class or cluster creation o...

G06F 2216/03   Data mining

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99953   Recoverability

Lightweight document clustering

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

7 Claims

Specification

Solutions

Use Cases

Quick Links

Lightweight document clustering

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

7 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links