System And Method For Clustering Unstructured Documents

US 20080104063A1
Filed: 12/24/2007
Published: 05/01/2008
Est. Priority Date: 08/31/2001
Status: Active Grant

First Claim

Patent Images

1. A system for clustering unstructured documents, comprising:

a selection module to select documents having terms with frequencies of occurrence that satisfy upper and lower edge conditions;

a concept module to generate concepts for the selected documents; and

a cluster module to group the selected documents into clusters of the documents, comprising;

an evaluation module to evaluate a weight for each of the clusters;

a determination module to determine a similarity value from the frequencies of occurrence for at least one of the terms from the concepts and the cluster weights for each selected document; and

an assignment module to assign each selected document into one such cluster based on the similarity value of the selected document.

View all claims

12 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for clustering unstructured documents is provided. Documents having terms with frequencies of occurrence that satisfy upper and lower edge conditions are selected. Concepts are generated for the selected documents. The selected documents are grouped into clusters of the documents. A weight for each of the clusters is evaluated. A similarity value is determined from the frequencies of occurrence for at least one of the terms from the concepts and the cluster weights for each selected document. Each selected document is assigned into one such cluster based on the similarity value of the selected document.

78 Citations

View as Search Results

24 Claims

1. A system for clustering unstructured documents, comprising:
- a selection module to select documents having terms with frequencies of occurrence that satisfy upper and lower edge conditions;
  
  a concept module to generate concepts for the selected documents; and
  
  a cluster module to group the selected documents into clusters of the documents, comprising;
  
  an evaluation module to evaluate a weight for each of the clusters;
  
  a determination module to determine a similarity value from the frequencies of occurrence for at least one of the terms from the concepts and the cluster weights for each selected document; and
  
  an assignment module to assign each selected document into one such cluster based on the similarity value of the selected document.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. A system according to claim 1, wherein each cluster comprises terms from one or more concepts.
  - 3. A system according to claim 1, further comprising:
    - a threshold module to determine the upper and lower edge conditions based on types of the selected documents.
  - 4. A system according to claim 1, further comprising:
    - a change module to change one such cluster, wherein the change comprises an addition or deletion of a document; and
      
      an update module to update the cluster to determine a best fit for the selected documents.
  - 5. A system according to claim 1, wherein the similarity value is calculated as a distance for each selected document.
  - 6. A system according to claim 5, wherein the distance is calculated according to the equation comprising:

7. A method for clustering unstructured documents, comprising:
- selecting documents having terms with frequencies of occurrence that satisfy upper and lower edge conditions;
  
  generating concepts for the selected documents; and
  
  grouping the selected documents into clusters of the documents, comprising;
  
  evaluating a weight for each of the clusters;
  
  determining a similarity value from the frequencies of occurrence for at least one of the terms from the concepts and the cluster weights for each selected document; and
  
  assigning each selected document into one such cluster based on the similarity value of the selected document.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. A method according to claim 7, wherein each cluster comprises terms from one or more concepts.
  - 9. A method according to claim 7, further comprising:
    - determining the upper and lower edge conditions based on types of the selected documents.
  - 10. A method according to claim 7, further comprising:
    - changing one such cluster, wherein the change comprises an addition or deletion of a document; and
      
      updating the cluster to determine a best fit for the selected documents.
  - 11. A method according to claim 7, wherein the similarity value is calculated as a distance for each selected document.
  - 12. A method according to claim 11, wherein the distance is calculated according to the equation comprising:

13. A system for providing thematically-grouped documents, comprising:
- a retrieval manager to extract terms from documents and to tabulate frequencies of occurrence for the terms in the documents;
  
  a text analyzer to generate themes from the terms, comprising;
  
  a selection module to select those terms with the frequencies of occurrence that satisfy upper and lower edge conditions; and
  
  a theme module to group the selected terms into the themes; and
  
  a cluster module to form clusters of the documents based on the themes, each cluster comprising a cluster weight, comprising;
  
  a correlation module to correlate one or more of the themes with each cluster;
  
  a similarity value module to determine a similarity value for each document derived from the frequencies of occurrence of the selected terms in that document and the cluster weight of the frequencies of occurrence of the selected terms for each document grouped in the theme for one such cluster; and
  
  an assignment module to assign each document to one of the clusters based on the similarity value for that document.
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. A system according to claim 13, further comprising:
    - a comparison module to compare the similarity value for each such document to a predefined variance; and
      
      a variance module to select those documents with the similarity value that satisfies the predefined variance for inclusion in one such theme.
  - 15. A system according to claim 13, further comprising:
    - a document module to remove duplicate documents from the matrix; and
      
      a reevaluation module to iteratively reevaluate the matrix by determining revised similarity values.
  - 16. A system according to claim 13, wherein the similarity value is calculated as a Euclidean distance for each document.
  - 17. A system according to claim 13, further comprising:
    - a threshold module to determine the upper and lower edge conditions, comprising;
      
      a map module to map the terms based on the frequencies of occurrence;
      
      a median value module to selecting a median value; and
      
      a calculation module to establish the upper and lower edge conditions as functions of the median value.
  - 18. A system according to claim 13, further comprising:
    - a weight module to update the cluster weights to determine a best fit between each document and one such cluster.

19. A method for providing thematically-grouped documents, comprising:
- extracting terms from documents and tabulating frequencies of occurrence for the terms in the documents;
  
  generating themes from the terms, comprising;
  
  selecting those terms with the frequencies of occurrence that satisfy upper and lower edge conditions; and
  
  grouping the selected terms into the themes; and
  
  forming clusters of the documents based on the themes, each cluster comprising a cluster weight, comprising;
  
  correlating one or more of the themes with each cluster;
  
  determining a similarity value for each document derived from the frequencies of occurrence of the selected terms in that document and the cluster weight of the frequencies of occurrence of the selected terms for each document grouped in the theme for one such cluster; and
  
  assigning each document to one of the clusters based on the similarity value for that document.
- View Dependent Claims (20, 21, 22, 23, 24)
- - 20. A method according to claim 19, further comprising:
    - comparing the similarity value for each such document to a predefined variance; and
      
      selecting those documents with the similarity value that satisfies the predefined variance for inclusion in one such theme.
  - 21. A method according to claim 19, further comprising:
    - removing duplicate documents from the matrix; and
      
      iteratively reevaluating the matrix by determining revised similarity values.
  - 22. A method according to claim 19, wherein the similarity value is calculated as a Euclidean distance for each document.
  - 23. A method according to claim 19, further comprising:
    - determining the upper and lower edge conditions, comprising;
      
      mapping the terms based on the frequencies of occurrence;
      
      selecting a median value; and
      
      establishing the upper and lower edge conditions as functions of the median value.
  - 24. A method according to claim 19, further comprising:
    - updating the cluster weights to determine a best fit between each document and one such cluster.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuix North America Inc. (Nuix Ltd.)
Original Assignee
FTI Consulting Technology LLC (FTI Consulting Incorporated)
Inventors
Kawai, Kenji, Gallivan, Dan

Granted Patent

US 7,809,727 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/5
CPC Class Codes

G06F 16/23   Updating

G06F 16/24575   using context

G06F 16/285   Clustering or classification

G06F 16/313   Selection or weighting of t...

G06F 16/35   Clustering; Classification

G06F 16/355   Class or cluster creation o...

G06F 16/93   Document management systems

G06F 16/955   using information identifie...

G06F 3/0641   De-duplication techniques

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99936   Pattern matching access

Y10S 707/99943   Generating database or data...

Y10S 707/99945   Object-oriented database st...

System And Method For Clustering Unstructured Documents

First Claim

12 Assignments

0 Petitions

Accused Products

Abstract

78 Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

System And Method For Clustering Unstructured Documents

First Claim

12 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

78 Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links