Methods and apparatus for probe-based clustering

US 20070112898A1
Filed: 11/15/2005
Published: 05/17/2007
Est. Priority Date: 11/15/2005
Status: Abandoned Application

First Claim

Patent Images

1. A method for identifying clusters of similar documents from among a set of documents, the method comprising:

(a) selecting a particular document from among available documents of the set of documents;

(b) generating a probe based on the particular document, the probe comprising one or more features;

(c) finding documents that satisfy a similarity condition using the probe from among the available documents;

(d) associating some or all of the documents that satisfy the similarity condition with a particular cluster of documents;

(e) repeating steps (a)-(d) using another probe as the probe and using another similarity condition as the similarity condition until a halting condition is satisfied to identify at least one other cluster of documents, wherein those documents of the set of documents previously associated with a cluster of documents are not included among the available documents.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for identifying clusters of similar documents from among a set of documents is described. A particular document is selected from among available documents of the set of documents, and a probe is generated based on the particular document. The probe comprises one or more features. Documents are found that satisfy a similarity condition using the probe from among the available documents. Some or all of the documents that satisfy the similarity condition are associated with a particular cluster of documents. The process can be repeated to generate further clusters. The method can be implemented with a computer, and associated programming instructions can be contained within a compute readable carrier.

Citations

23 Claims

1. A method for identifying clusters of similar documents from among a set of documents, the method comprising:
- (a) selecting a particular document from among available documents of the set of documents;
  
  (b) generating a probe based on the particular document, the probe comprising one or more features;
  
  (c) finding documents that satisfy a similarity condition using the probe from among the available documents;
  
  (d) associating some or all of the documents that satisfy the similarity condition with a particular cluster of documents;
  
  (e) repeating steps (a)-(d) using another probe as the probe and using another similarity condition as the similarity condition until a halting condition is satisfied to identify at least one other cluster of documents, wherein those documents of the set of documents previously associated with a cluster of documents are not included among the available documents.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
- - 2. The method of claim 1, wherein said another similarity condition is the same as the similarity condition.
  - 3. The method of claim 1, wherein the probe comprises the particular document.
  - 4. The method of claim 1, wherein the probe comprises a subset of features selected from the particular document.
  - 5. The method of claim 1, wherein the probe comprises a subset of features selected from multiple documents of the set of documents, and wherein the subset of features includes features of the particular document.
  - 6. The method of claim 1, wherein the particular document is selected randomly from among the set of documents.
  - 7. The method of claim 1, comprising ranking the documents of said particular cluster and ranking the documents of said at least one other cluster.
  - 8. The method of claim 1, comprising providing an identifier that describes content of the particular cluster of documents.
  - 9. The method of claim 1, comprising refining the probe by reforming the probe using at least one new documents from the set of documents.
  - 10. The method of claim 1, comprising:
    - obtaining a set of candidate documents from among the set of documents to be documents from which to form probes including said probe, wherein selecting the particular document in step (a) comprises selecting from the set of candidate documents.
  - 11. The method of claim 10, comprising:
    - updating the set of candidate documents by removing from the set of candidate documents any documents identified to be associated with a cluster of documents.
  - 13. The apparatus of claim 1, wherein said another similarity condition is the same as the similarity condition.
  - 14. The apparatus of claim 1, wherein the probe comprises the particular document.
  - 15. The apparatus of claim 1, wherein the probe comprises a subset of features selected from the particular document.
  - 16. The apparatus of claim 1, wherein the probe comprises a subset of features selected from multiple documents of the set of documents, and wherein the subset of features includes features of the particular document.
  - 17. The apparatus of claim 1, wherein the particular document is selected randomly from among the set of documents.
  - 18. The apparatus of claim 1, wherein the processor is configured to rank the documents of said particular cluster and rank the documents of said at least one other cluster.
  - 19. The apparatus of claim 1, wherein the processor is configured to provide an identifier that describes content of the particular cluster of documents.
  - 20. The apparatus of claim 1, wherein the processor is configured to refine the probe by reforming the probe using at least some new documents from the set of documents.
  - 21. The apparatus of claim 1, wherein the processor is configured to:
    - obtain a set of candidate documents from among the set of documents to be documents from which to form probes including said probe, wherein selecting the particular document in step (a) comprises selecting from the set of candidate documents.
  - 22. The apparatus of claim 21, wherein the processor is configured to update the set of candidate documents by removing from the set of candidate documents any documents identified to be associated with a cluster of documents.
  - 23. A computer readable carrier comprising processing instructions adapted to cause a processor to execute the method of claim 1.

12. An apparatus for identifying clusters of similar documents from among a set of documents, comprising:
- a memory; and
  
  a processor coupled to the memory, wherein the processor is configured to execute the steps of;
  
  (a) selecting a particular document from among available documents of the set of documents;
  
  (b) generating a probe based on the particular document, the probe comprising one or more features;
  
  (c) finding documents that satisfy a similarity condition using the probe from among the available documents;
  
  (d) associating some or all of the documents that satisfy the similarity condition with a particular cluster of documents;
  
  (e) repeating steps (a)-(d) using another probe as the probe and using another similarity condition as the similarity condition until a halting condition is satisfied to identify at least one other cluster of documents, wherein those documents of the set of documents previously associated with a cluster of documents are not included among the available documents.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Justsystems Evans Research Incorporated
Original Assignee
Clairvoyance Corporation
Inventors
Bennett, Jeffrey, Evans, David, Sheftel, Victor

Application Number

US11/272,785
Publication Number

US 20070112898A1
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/35 Clustering; Classification

Methods and apparatus for probe-based clustering

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and apparatus for probe-based clustering

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links