System and method for clustering unstructured documents

US 7,809,727 B2
Filed: 12/24/2007
Issued: 10/05/2010
Est. Priority Date: 08/31/2001
Status: Active Grant

First Claim

Patent Images

1. A system for clustering unstructured documents, comprising:

a selection module that selects documents having terms with frequencies of occurrence of the terms that satisfy upper edge conditions less than 100% and lower edge conditions greater than 0% from a set of documents;

a concept module that generates concepts based on one or more of the terms for the selected documents; and

a cluster module that groups the selected documents into clusters, comprising;

an evaluation module that evaluates a weight for each of the clusters;

a determination module that determines, for each of the selected documents, inner products of that selected document and each cluster from the frequencies of occurrence for at least one of the terms from the concepts and the cluster weights; and

an assignment module that assigns each selected document into one such cluster based on the inner products of the selected document; and

a processor to execute each of the modules, which are stored on a computer-readable storage medium.

View all claims

12 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for clustering unstructured documents is provided. Documents having terms with frequencies of occurrence that satisfy upper and lower edge conditions are selected. Concepts are generated for the selected documents. The selected documents are grouped into clusters of the documents. A weight for each of the clusters is evaluated. A similarity value is determined from the frequencies of occurrence for at least one of the terms from the concepts and the cluster weights for each selected document. Each selected document is assigned into one such cluster based on the similarity value of the selected document.

53 Citations

View as Search Results

16 Claims

1. A system for clustering unstructured documents, comprising:
- a selection module that selects documents having terms with frequencies of occurrence of the terms that satisfy upper edge conditions less than 100% and lower edge conditions greater than 0% from a set of documents;
  
  a concept module that generates concepts based on one or more of the terms for the selected documents; and
  
  a cluster module that groups the selected documents into clusters, comprising;
  
  an evaluation module that evaluates a weight for each of the clusters;
  
  a determination module that determines, for each of the selected documents, inner products of that selected document and each cluster from the frequencies of occurrence for at least one of the terms from the concepts and the cluster weights; and
  
  an assignment module that assigns each selected document into one such cluster based on the inner products of the selected document; and
  
  a processor to execute each of the modules, which are stored on a computer-readable storage medium.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. A system according to claim 1, wherein each cluster corresponds to terms from one or more concepts.
  - 3. A system according to claim 1, further comprising:
    - a threshold module that determines the upper and lower edge conditions based on types of the selected documents.
  - 4. A system according to claim 1, further comprising:
    - a change module that changes one such cluster, wherein the change comprises an addition or deletion of a document; and
      
      an update module that updates the cluster to determine a best fit for the selected documents.
  - 5. A system according to claim 1, wherein each inner product is calculated as a distance for each selected document.
  - 6. A system according to claim 5, wherein the distance is calculated according to the equation comprising:
  - 7. A system according to claim 1, further comprising:
    - a threshold module that determines the upper and lower edge conditions, comprising;
      
      a median value module that selects a median value by mapping the terms based on the frequencies of occurrence; and
      
      a calculation module that establishes the upper and lower edge conditions as functions of the median value.
  - 8. A system according to claim 1, further comprising:
    - a cluster creation module that creates the clusters.

9. A computer-implemented method for clustering unstructured documents, comprising the steps of:
- selecting documents having terms with frequencies of occurrence of the terms that satisfy upper edge conditions less than 100% and lower edge conditions greater than 0% from a set of documents;
  
  generating concepts based on one or more of the terms for the selected documents; and
  
  grouping the selected documents into clusters, comprising;
  
  evaluating a weight for each of the clusters;
  
  determining, for each of the selected documents, inner products of that selected document and each cluster from the frequencies of occurrence for at least one of the terms from the concepts and the cluster weights; and
  
  assigning each selected document into one such cluster based on the inner products of the selected document,wherein all the steps are performed on a suitably programmed computer.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. A method according to claim 9, wherein each cluster corresponds to terms from one or more concepts.
  - 11. A method according to claim 9, further comprising:
    - determining the upper and lower edge conditions based on types of the selected documents.
  - 12. A method according to claim 9, further comprising:
    - changing one such cluster, wherein the change comprises an addition or deletion of a document; and
      
      updating the cluster to determine a best fit for the selected documents.
  - 13. A method according to claim 9, wherein each inner product is calculated as a distance for each selected document.
  - 14. A method according to claim 13, wherein the distance is calculated according to the equation comprising:
  - 15. A method according to claim 9, further comprising:
    - creating the clusters.
  - 16. A method according to claim 9, further comprising:
    - determining the upper and lower edge conditions, comprising;
      
      selecting a median value by mapping the terms based on the frequencies of occurrence; and
      
      establishing the upper and lower edge conditions as functions of the median value.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuix North America Inc. (Nuix Ltd.)
Original Assignee
FTI Consulting Technology LLC (FTI Consulting Incorporated)
Inventors
Kawai, Kenji, Gallivan, Dan
Primary Examiner(s)
Trujillo; James
Assistant Examiner(s)
Spieler; William

Application Number

US11/964,000
Publication Number

US 20080104063A1
Time in Patent Office

1,016 Days
Field of Search

707/738, 707/750, 707/777
US Class Current

707/738
CPC Class Codes

G06F 16/23   Updating

G06F 16/24575   using context

G06F 16/285   Clustering or classification

G06F 16/313   Selection or weighting of t...

G06F 16/35   Clustering; Classification

G06F 16/355   Class or cluster creation o...

G06F 16/93   Document management systems

G06F 16/955   using information identifie...

G06F 3/0641   De-duplication techniques

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99936   Pattern matching access

Y10S 707/99943   Generating database or data...

Y10S 707/99945   Object-oriented database st...

System and method for clustering unstructured documents

First Claim

12 Assignments

0 Petitions

Accused Products

Abstract

53 Citations

16 Claims

Specification

Use Cases

Quick Links

Others

System and method for clustering unstructured documents

First Claim

12 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

53 Citations

16 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others