System and method for efficiently generating cluster groupings in a multi-dimensional concept space

US 6,778,995 B1
Filed: 08/31/2001
Issued: 08/17/2004
Est. Priority Date: 08/31/2001
Status: Active Grant

First Claim

Patent Images

1. A system for building a multi-dimensional semantic concept space over a stored document collection, comprising:

an extraction module identifying a plurality of documents within a stored document collection containing substantially correlated terms reflecting syntactic content, comprising;

an extractor extracting the terms in literal form from the documents;

a selector selecting the terms having frequencies of occurrence falling within a predefined threshold as being substantially correlated;

a vector module generating a vector reflecting latent semantic similarities discovered between substantially correlated documents logically projected at an angle θ

from a common axis in a concept space;

a cluster module forming one or more arbitrary clusters at an angle σ

from the common axis in the concept space, each cluster comprising documents having such an angle θ

falling within a predefined variance of the angle σ

for the cluster, and constructing a new arbitrary cluster at an angle σ

from the common axis in the concept space, each new cluster comprising documents having such an angle θ

falling outside the predefined variance of the angle σ

for the remaining clusters.

View all claims

12 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for efficiently generating cluster groupings in a multi-dimensional concept space is described. A plurality of terms are extracted from each document in a collection of stored unstructured documents. A concept space is built over the document collection. Terms substantially correlated between a plurality of documents within the document collection are identified. Each correlated term is expressed as a vector mapped along an angle θ originating from a common axis in the concept space. A difference between the angle θ for each document and an angle σ for each cluster within the concept space is determined. Each such cluster is populated with those documents having such difference between the angle θ for each such document and the angle σ for each such cluster falling within a predetermined variance. A new cluster is created within the concept space those documents having such difference between the angle θ for each such document and the angle σ for each such cluster falling outside the predetermined variance.

Citations

32 Claims

1. A system for building a multi-dimensional semantic concept space over a stored document collection, comprising:
- an extraction module identifying a plurality of documents within a stored document collection containing substantially correlated terms reflecting syntactic content, comprising;
  
  an extractor extracting the terms in literal form from the documents;
  
  a selector selecting the terms having frequencies of occurrence falling within a predefined threshold as being substantially correlated;
  
  a vector module generating a vector reflecting latent semantic similarities discovered between substantially correlated documents logically projected at an angle θ
  
  from a common axis in a concept space;
  
  a cluster module forming one or more arbitrary clusters at an angle σ
  
  from the common axis in the concept space, each cluster comprising documents having such an angle θ
  
  falling within a predefined variance of the angle σ
  
  for the cluster, and constructing a new arbitrary cluster at an angle σ
  
  from the common axis in the concept space, each new cluster comprising documents having such an angle θ
  
  falling outside the predefined variance of the angle σ
  
  for the remaining clusters.
- View Dependent Claims (2, 3, 4)
- - 2. A system according to claim 1, further comprising:
3. A system according to claim 1, further comprising:
- a finalization module finalizing the clusters, comprising at least one of merging a plurality of clusters into a single cluster, splitting a cluster into a plurality of clusters, and removing at least one of a minimal or outlier cluster.
4. A system according to claim 1, further comprising:
- a generation module generating the clusters through k-means clustering.

5. A method for building a multi-dimensional semantic concept space over a stored document collection, comprising:
- identifying a plurality of documents within a stored document collection containing substantially correlated terms reflecting syntactic content, comprising;
  
  extracting the terms in literal form from the documents;
  
  selecting the terms having frequencies of occurrence falling within a predefined threshold as being substantially correlated;
  
  generating a vector reflecting latent semantic similarities discovered between substantially correlated documents logically projected at an angle θ
  
  from a common axis in a concept space;
  
  forming one or more arbitrary clusters at an angle σ
  
  from the common axis in the concept space, each cluster comprising documents having such an angle θ
  
  falling within a predefined variance of the angle σ
  
  for the cluster; and
  
  constructing a new arbitrary cluster at an angle σ
  
  from the common axis in the concept space, each new cluster comprising documents having such an angle θ
  
  falling outside the predefined variance of the angle σ
  
  for the remaining clusters.
- View Dependent Claims (6, 7, 8, 9)
- - 6. A method according to claim 5, further comprising:
7. A method according to claim 5, further comprising:
- finalizing the clusters, comprising at least one of merging a plurality of clusters into a single cluster, splitting a cluster into a plurality of clusters, and removing at least one of a minimal or outlier cluster.
8. A method according to claim 5, further comprising:
- generating the clusters through k-means clustering.
9. A computer-readable storage medium holding code for performing the method according to claims 5, 6, 7, or 8.

10. A system for efficiently generating cluster groupings in a multi-dimensional concept space, comprising:
- an extraction module extracting a plurality of terms from each document in a collection of stored unstructured documents, comprising;
  
  an extractor extracting the terms in literal form from the documents;
  
  a selector selecting the terms having frequencies of occurrence falling within a redefined threshold as being substantially correlated; and
  
  a cluster module building a concept space over the document collection, comprising;
  
  an identifier submodule identifying terms substantially correlated between a plurality of documents within the document collection;
  
  a mapping submodule expressing each correlated term as a vector mapped along an angle θ
  
  originating from a common axis in the concept space;
  
  a difference submodule determining a difference between the angle θ
  
  for each document and an angle σ
  
  for each cluster within the concept space;
  
  a build submodule populating an arbitrary cluster with those documents having such difference between the angle θ
  
  for each such document and the angle σ
  
  for each such cluster falling within a predetermined variance and creating a new arbitrary cluster within the concept space those documents having such difference between the angle θ
  
  for each such document and the angle σ
  
  for each such cluster falling outside the predetermined variance.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 11. A system according to claim 10, further comprising:
12. A system according to claim 10, further comprising:
- a formation module forming a plurality of terms into at least one phrase.
13. A system according to claim 10, further comprising:
- a formation module forming a plurality of concepts into at least one theme.
14. A system according to claim 10, further comprising:
- a calculation module calculating a cosine representing a difference between the angle θ and
  
  the common axis.
15. A system according to claim 10, further comprising:
- a normalize submodule normalizing each vector.
16. A system according to claim 10, further comprising:
- a histogram module determining a histogram of concepts in each unstructured document, each concept representing a term occurring in one or more of the unstructured documents.
17. A system according to claim 10, further comprising:
- a corpus module determining a frequency of occurrences of concepts in the collection of unstructured documents, each concept representing a term occurring in one or more of the unstructured documents.
18. A system according to claim 10, further comprising:
- a merger module merging a plurality of clusters into a single cluster.
19. A system according to claim 10, further comprising:
- a splitter module splitting a cluster into a plurality of clusters.
20. A system according to claim 10, further comprising:
- a filter module removing at least one of a minimal or outlier cluster.

21. A method for efficiently generating cluster groupings in a multi-dimensional concept space, comprising:
- extracting a plurality of terms from each document in a collection of stored unstructured documents; and
  
  building a concept space over the document collection, comprising;
  
  identifying terms substantially correlated between a plurality of documents within the document collection, comprising;
  
  extracting the terms in literal form from the documents;
  
  selecting the terms having frequencies of occurrence falling within a predefined threshold as being substantially correlated;
  
  expressing each correlated term as a vector mapped along an angle θ
  
  originating from a common axis in the concept space;
  
  determining a difference between the angle θ
  
  for each document and an angle σ
  
  for each cluster within the concept space;
  
  populating an arbitrary cluster with those documents having such difference between the angle θ
  
  for each such document and the angle σ
  
  for each such cluster falling within a predetermined variance; and
  
  creating a new arbitrary cluster within the concept space those documents having such difference between the angle θ
  
  for each such document and the angle σ
  
  for each such cluster falling outside the predetermined variance.
- View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)
- - 22. A method according to claim 21, further comprising:
23. A method according to claim 21, further comprising:
- forming a plurality of terms into at least one phrase.
24. A method according to claim 21, further comprising:
- forming a plurality of concepts into at least one theme.
25. A method according to claim 21, further comprising:
- calculating a cosine representing a difference between the angle θ and
  
  the common axis.
26. A method according to claim 21, further comprising:
- normalizing each vector.
27. A method according to claim 21, further comprising:
- determining a histogram of concepts in each unstructured document, each concept representing a term occurring in one or more of the unstructured documents.
28. A method according to claim 21, further comprising:
- determining a frequency of occurrences of concepts in the collection of unstructured documents, each concept representing a term occurring in one or more of the unstructured documents.
29. A method according to claim 21, further comprising:
- merging a plurality of clusters into a single cluster.
30. A method according to claim 21, further comprising:
- splitting a cluster into a plurality of clusters.
31. A method according to claim 21, further comprising:
- removing at least one of a minimal or outlier cluster.
32. A computer-readable storage medium holding code for performing the method according to claims 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuix North America Inc. (Nuix Ltd.)
Original Assignee
Attenex Corp. (FTI Consulting Incorporated)
Inventors
Gallivan, Dan
Primary Examiner(s)
AMSBURY, WAYNE P

Application Number

US09/943,918
Time in Patent Office

1,082 Days
Field of Search

707/102
US Class Current

707/739
CPC Class Codes

G06F 16/283   Multi-dimensional databases...

G06F 16/287   Visualization; Browsing

G06F 16/355   Class or cluster creation o...

G06F 16/93   Document management systems

G06F 16/951   Indexing; Web crawling tech...

G06F 16/9535   Search customisation based ...

Y10S 707/99943   Generating database or data...

System and method for efficiently generating cluster groupings in a multi-dimensional concept space

First Claim

12 Assignments

0 Petitions

Accused Products

Abstract

Citations

32 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for efficiently generating cluster groupings in a multi-dimensional concept space

First Claim

12 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

32 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links