Computer-Implemented System and Method For Generating A Reference Set Via Clustering

US 20140108406A1
Filed: 12/16/2013
Published: 04/17/2014
Est. Priority Date: 08/24/2009
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for generating a reference set via clustering, comprising:

obtaining a collection of unclassified documents;

grouping the unclassified documents into clusters;

selecting n-documents from each cluster and combining the selected n-documents as reference set candidates, wherein one of the n-documents from each cluster is located closest to a center of that cluster;

assigning a classification code to each of the reference set candidates; and

grouping two or more of the reference set candidates as a reference set of classified documents.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer-implemented system and method for generating a reference set via clustering is provided. A collection of unclassified documents is obtained and grouped into clusters. N-documents are selected from each cluster and are combined as reference set candidates. One of the n-documents from each cluster is located closest to a center of that cluster. A classification code is assigned to each of the reference set candidates. Two or more of the reference set candidates are grouped as a reference set of classified documents.

4 Citations

View as Search Results

20 Claims

1. A computer-implemented method for generating a reference set via clustering, comprising:
- obtaining a collection of unclassified documents;
  
  grouping the unclassified documents into clusters;
  
  selecting n-documents from each cluster and combining the selected n-documents as reference set candidates, wherein one of the n-documents from each cluster is located closest to a center of that cluster;
  
  assigning a classification code to each of the reference set candidates; and
  
  grouping two or more of the reference set candidates as a reference set of classified documents.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. A method according to claim 1, further comprising:
    - building a hierarchical tree of the clusters; and
      
      traversing the hierarchical tree to identify the n-documents.
  - 3. A method according to claim 1, further comprising:
    - applying a size threshold to the reference set candidates; and
      
      selecting the reference set candidates for inclusion in the reference set when the size threshold is satisfied.
  - 4. A method according to claim 1, further comprising:
    - applying a size threshold to the reference set candidates; and
      
      clustering the reference set candidates until the size threshold is satisfied.
  - 5. A method according to claim 4, wherein the reference set candidates are clustered via one of agglomerative and divisive clustering.
  - 6. A method according to claim 1, further comprising at least one of:
    - receiving a number of the n-documents from a user; and
      
      determining the number of the n-documents.
  - 7. A method according to claim 1, further comprising:
    - selecting an additional n-document from each cluster that is furthest from the cluster center.
  - 8. A method according to claim 1, further comprising:
    - refining the reference set candidates, comprising at least one of;
      
      changing clustering input parameters and reclustering the unclassified documents based on the clustering input parameters;
      
      changing the unclassified document collection by filtering out a portion of the unclassified documents; and
      
      selecting different n-documents from each of the clusters.
  - 9. A method according to claim 1, further comprising:
    - identifying features of the unclassified documents;
      
      grouping the features into clusters;
      
      identifying n-features from each cluster as reference set candidate features;
      
      assigning a classification code to each of the reference set candidate features; and
      
      grouping at least a portion of the documents associated with the classified reference set candidate features as a further reference set.
  - 10. A method according to claim 1, further comprising:
    - propagating the classification codes of the selected n-documents to a further set of unclassified documents.

11. A computer-implemented system for generating a reference set via clustering, comprising:
- a collection module to obtain a collection of unclassified documents;
  
  a clustering module to group the unclassified documents into clusters;
  
  a candidate selection module to select n-documents from each cluster and to combine the selected n-documents as reference set candidates, wherein one of the n-documents from each cluster is located closest to a center of that cluster;
  
  a classification module to assign a classification code to each of the reference set candidates; and
  
  a reference set module to group two or more of the reference set candidates as a reference set of classified documents.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 12. A system according to claim 11, further comprising:
    - a tree module to build a hierarchical tree of the clusters; and
      
      a traversal module to traverse the hierarchical tree to identify the n-documents.
  - 13. A system according to claim 11, further comprising:
    - a size module to apply a size threshold to the reference set candidates and to select the reference set candidates for inclusion in the reference set when the size threshold is satisfied.
  - 14. A system according to claim 11, further comprising:
    - a size module to apply a size threshold to the reference set candidates and to cluster the reference set candidates until the size threshold is satisfied.
  - 15. A system according to claim 14, wherein the reference set candidates are clustered via one of agglomerative and divisive clustering.
  - 16. A system according to claim 11, further comprising at least one of:
    - an instruction receipt module to receive a number of the n-documents from a user; and
      
      a document determination module to determine the number of the n-documents.
  - 17. A system according to claim 11, further comprising:
    - selecting an additional n-document from each cluster that is furthest from the cluster center.
  - 18. A system according to claim 11, further comprising:
    - a refining module to refine the reference set candidates, comprising at least one of;
      
      a parameter module to change clustering input parameters and to recluster the unclassified documents based on the clustering input parameters;
      
      a filter module to change the unclassified document collection by filtering out a portion of the unclassified documents; and
      
      a document subset module to select different n-documents from each of the clusters.
  - 19. A system according to claim 11, further comprising:
    - a feature identification module to identify features of the unclassified documents;
      
      a feature grouping module to group the features into clusters;
      
      a candidate feature module to identify n-features from each cluster as reference set candidate features;
      
      a feature classification module to assign a classification code to each of the reference set candidate features; and
      
      a feature reference module to group at least a portion of the documents associated with the classified reference set candidate features as a further reference set.
  - 20. A system according to claim 11, further comprising:
    - a propagation module to propagate the classification codes of the selected n-documents to a further set of unclassified documents.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuix North America Inc. (Nuix Ltd.)
Original Assignee
FTI Consulting Incorporated
Inventors
Knight, William C., McNee, Sean M.

Granted Patent

US 9,336,496 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/737
CPC Class Codes

G06F 16/285   Clustering or classification

G06F 16/35   Clustering; Classification

G06F 16/40   of multimedia data, e.g. sl...

G06F 16/93   Document management systems

G06N 5/02   Knowledge representation; S...

Computer-Implemented System and Method For Generating A Reference Set Via Clustering

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

4 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Computer-Implemented System and Method For Generating A Reference Set Via Clustering

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

4 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links