System and method for efficiently generating cluster groupings in a multi-dimensional concept space

US 8,402,026 B2
Filed: 08/03/2004
Issued: 03/19/2013
Est. Priority Date: 08/31/2001
Status: Active Grant

First Claim

Patent Images

1. A system for creating stored cluster representations of document semantics, comprising:

a text analyzer configured to order concepts contained in documents selected from a document store by overall frequencies of occurrence to form a corpus of the documents;

a document selection module configured to select a subset of the documents in the corpus that contain those concepts having frequencies of occurrence that occur within a bounded range of concept frequencies, comprising;

a median determination submodule configured to set a median for the bounded range by document type;

a bounded range determination submodule configured to establish upper and lower edge conditions of the bounded range relative to the median; and

a selection submodule configured to select the documents that occur within the upper and lower edge conditions;

a cluster module configured to assign the documents in the subset into clusters, comprising;

an initial cluster submodule configured to group those documents from the subset that contain matching concepts into an arbitrary cluster for each of the matching concepts;

a distance determination submodule configured to determine Euclidian distances between each of the arbitrary clusters and each remaining document that is not yet grouped into a cluster and to apply a variance of five percent to the Euclidean distances; and

a secondary cluster submodule configured to place each remaining document into the arbitrary cluster for which the Euclidean distance between that remaining document and that arbitrary cluster falls within the variance;

a cluster formation module configured to form a new arbitrary cluster for each remaining document that was not previously placed in one of the arbitrary clusters and that is associated with Euclidean distances that all fall outside the variance;

a database configured to finalize and store the arbitrary clusters; and

a processor configured to execute the modules.

View all claims

13 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for efficiently generating cluster groupings in a multi-dimensional concept space is described. A plurality of terms is extracted from each document in a collection of stored unstructured documents. A concept space is built over the document collection. Terms substantially correlated between a plurality of documents within the document collection are identified. Each correlated term is expressed as a vector mapped along an angle θ originating from a common axis in the concept space. A difference between the angle θ for each document and an angle σ for each cluster within the concept space is determined. Each such cluster is populated with those documents having such difference between the angle θ for each such document and the angle σ for each such cluster falling within a predetermined variance. A new cluster is created within the concept space those documents having such difference between the angle θ for each such document and the angle σ for each such cluster falling outside the predetermined variance.

Citations

25 Claims

1. A system for creating stored cluster representations of document semantics, comprising:
- a text analyzer configured to order concepts contained in documents selected from a document store by overall frequencies of occurrence to form a corpus of the documents;
  
  a document selection module configured to select a subset of the documents in the corpus that contain those concepts having frequencies of occurrence that occur within a bounded range of concept frequencies, comprising;
  
  a median determination submodule configured to set a median for the bounded range by document type;
  
  a bounded range determination submodule configured to establish upper and lower edge conditions of the bounded range relative to the median; and
  
  a selection submodule configured to select the documents that occur within the upper and lower edge conditions;
  
  a cluster module configured to assign the documents in the subset into clusters, comprising;
  
  an initial cluster submodule configured to group those documents from the subset that contain matching concepts into an arbitrary cluster for each of the matching concepts;
  
  a distance determination submodule configured to determine Euclidian distances between each of the arbitrary clusters and each remaining document that is not yet grouped into a cluster and to apply a variance of five percent to the Euclidean distances; and
  
  a secondary cluster submodule configured to place each remaining document into the arbitrary cluster for which the Euclidean distance between that remaining document and that arbitrary cluster falls within the variance;
  
  a cluster formation module configured to form a new arbitrary cluster for each remaining document that was not previously placed in one of the arbitrary clusters and that is associated with Euclidean distances that all fall outside the variance;
  
  a database configured to finalize and store the arbitrary clusters; and
  
  a processor configured to execute the modules.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. A system according to claim 1, wherein the distance determination submodule is further configured to redetermine the Euclidean distances between each remaining document and at least one such cluster following the placing of one or more of the other remaining documents into the at least one such cluster at least once until the at least one such cluster settles.
  - 3. A system according to claim 1, wherein the Euclidean distances are determined based on independent variables selected from the group comprising concepts, frequencies of occurrence, and documents.
  - 4. A system according to claim 1, further comprising:
    - a concept removal module configured to remove concepts falling outside the upper and lower edge conditions, which are set in a range substantially comprising from one percent to fifteen percent.
  - 5. A system according to claim 1, wherein the documents comprise email.
  - 6. A system according to claim 1, further comprising:
    - an arrangement module to arrange a center of each cluster around a common origin; and
      
      a display to present the arranged clusters.

7. A method for creating stored cluster representations of document semantics, comprising:
- ordering concepts contained in documents selected from a document store by overall frequencies of occurrence to form a corpus of the documents;
  
  selecting a subset of the documents in the corpus that contain those concepts having frequencies of occurrence that occur within a bounded range of concept frequencies, comprising;
  
  setting a median for the bounded range by document type;
  
  establishing upper and lower edge conditions of the bounded range relative to the median; and
  
  selecting the documents that occur within the upper and lower edge conditions;
  
  assigning the documents in the subset into clusters, comprising;
  
  grouping those documents from the subset that contain matching concepts into an arbitrary cluster for each of the matching concepts;
  
  determining Euclidian distances between each of the arbitrary clusters and each remaining document that is not yet grouped into a cluster;
  
  applying a variance of five percent to the Euclidean distances; and
  
  placing each remaining document into the arbitrary cluster for which the Euclidean distance between that remaining document and that cluster falls within the variance;
  
  forming a new arbitrary cluster for each remaining document that was not previously placed in one of the arbitrary clusters and that is associated with Euclidean distances that all fall outside the variance; and
  
  finalizing and storing the arbitrary clusters.
- View Dependent Claims (8, 9, 10, 11, 12, 13, 14)
- - 8. A method according to claim 7, further comprising:
    - redetermining the Euclidean distances between each remaining document and at least one such cluster following the placing of one or more of the other remaining documents into the at least one such cluster at least once until the at least one such cluster settles.
  - 9. A method according to claim 7, wherein finalizing the clusters comprises one or more of:
    - merging a plurality of the clusters into a single cluster;
      
      splitting a single cluster into a plurality of clusters; and
      
      deleting outlier clusters.
  - 10. A method according to claim 7, further comprising:
    - determining the Euclidean distances based on independent variables selected from the group comprising concepts, frequencies of occurrence, and documents.
  - 11. A computer-readable storage medium holding code for performing the method according to claim 7.
  - 12. A method according to claim 7, further comprising:
    - removing concepts falling outside the upper and lower edge conditions, which are set in a range substantially comprising from one percent to fifteen percent.
  - 13. A method according to claim 7, wherein the documents comprise email.
  - 14. A method according to claim 7, further comprising:
    - arranging a center of each cluster around a common origin; and
      
      displaying the arranged clusters.

15. A system for displaying stored cluster representations of document semantics, comprising:
- a document store configured to store documents containing one or more concepts;
  
  a text analyzer configured to order concepts parsed from the documents by overall frequencies of occurrence in the documents and further configured to identify those concepts having frequencies of occurrence that occur within upper and lower thresholds for concept frequencies;
  
  an initial cluster module configured to choose sets of the identified concepts that match and further configured to create an arbitrary cluster for those of the documents corresponding to each set of concepts that match;
  
  a further cluster module configured to place each remaining document that is not yet in a cluster, comprising;
  
  a distance measuring submodule configured to determine Euclidian distances between each arbitrary cluster and an origin and the Euclidian distance between the remaining document and the origin;
  
  a distance evaluation submodule configured to evaluate the Euclidian distances of the arbitrary clusters against the Euclidian distance of the remaining document; and
  
  a document placer submodule configured to place the remaining document into the arbitrary cluster at minimal variance of five percent from the remaining document and into a new arbitrary cluster when the arbitrary clusters exceed the variance and the remaining document was not previously placed in one of the arbitrary clusters;
  
  a visualization module configured to present the arbitrary clusters projected onto a two-dimensional display space; and
  
  a processor configured to execute the modules.
- View Dependent Claims (16, 17, 18, 19)
- - 16. A system according to claim 15, wherein a new Euclidean distance is determined between one such cluster and the origin following the placing of at least one remaining document into the one such cluster.
  - 17. A system according to claim 15, further comprising:
    - a concept extractor configured to extract the concepts from the documents; and
      
      a preprocessor configured to preprocess the concepts, comprising one or more of;
      
      a normalizer configured to normalize the concepts; and
      
      a filter configured to remove concepts not relevant to semantic content.
  - 18. A system according to claim 15, further comprising:
    - a cluster finalizer configured to finalize the clusters, comprising one or more of;
      
      a merger configured to merge a plurality of the clusters into a single cluster;
      
      a splitter configured to split a single cluster into a plurality of clusters; and
      
      a cluster remover configured to delete outlier clusters.
  - 19. A system according to claim 15, wherein the Euclidean distances are based on independent variables selected from the group comprising concepts, frequencies of occurrence, and documents.

20. A method for displaying stored cluster representations of document semantics, comprising:
- selecting concepts parsed from documents stored in a document store;
  
  ordering the concepts by overall frequencies of occurrence in the documents and identifying those concepts having frequencies of occurrence that occur within upper and lower thresholds for concept frequencies;
  
  choosing sets of the identified concepts that match and creating an arbitrary cluster for those of the documents corresponding to each set of concepts that match;
  
  placing each remaining document that is not yet in a cluster, comprising;
  
  determining Euclidian distances between each arbitrary cluster and an origin and the Euclidian distance between the remaining document and the origin;
  
  evaluating the Euclidian distances of the arbitrary clusters against the Euclidian distance of the remaining document; and
  
  placing the remaining document into the arbitrary cluster at minimal variance of five percent from the remaining document and into a new arbitrary cluster when the clusters exceed the variance and the remaining document was not previously placed in one of the clusters; and
  
  presenting the arbitrary clusters projected onto a two-dimensional display space.
- View Dependent Claims (21, 22, 23, 24, 25)
- - 21. A method according to claim 20, further comprising:
    - determining a new Euclidean distance between one such cluster and the origin following the placing of at least one remaining document into the one such cluster.
  - 22. A method according to claim 20, further comprising:
    - extracting the concepts from the documents; and
      
      preprocessing the concepts, comprising one or more of;
      
      normalizing the concepts; and
      
      removing concepts not relevant to semantic content.
  - 23. A method according to claim 20, further comprising:
    - finalizing the clusters, comprising one or more of;
      
      merging a plurality of the clusters into a single cluster;
      
      splitting a single cluster into a plurality of clusters; and
      
      deleting outlier clusters.
  - 24. A method according to claim 20, wherein the Euclidean distances are based on independent variables selected from the group comprising concepts, frequencies of occurrence, and documents.
  - 25. A computer-readable storage medium holding code for performing the method according to claim 20.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuix North America Inc. (Nuix Ltd.)
Original Assignee
FTI Consulting Technology LLC (FTI Consulting Incorporated)
Inventors
Gallivan, Dan
Primary Examiner(s)
Lewis, Cheryl
Assistant Examiner(s)
Hoffler, Raheem

Application Number

US10/911,376
Publication Number

US 20050010555A1
Time in Patent Office

3,150 Days
Field of Search

None
US Class Current

707/737
CPC Class Codes

G06F 16/283   Multi-dimensional databases...

G06F 16/287   Visualization; Browsing

G06F 16/355   Class or cluster creation o...

G06F 16/93   Document management systems

G06F 16/951   Indexing; Web crawling tech...

G06F 16/9535   Search customisation based ...

Y10S 707/99943   Generating database or data...

System and method for efficiently generating cluster groupings in a multi-dimensional concept space

First Claim

13 Assignments

0 Petitions

Accused Products

Abstract

Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for efficiently generating cluster groupings in a multi-dimensional concept space

First Claim

13 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links