System and method for efficiently generating cluster groupings in a multi-dimensional concept space
First Claim
1. A system for building a multi-dimensional semantic concept space over a stored document collection, comprising:
- an extraction module identifying a plurality of documents within a stored document collection containing substantially correlated terms reflecting syntactic content, comprising;
an extractor extracting the terms in literal form from the documents;
a selector selecting the terms having frequencies of occurrence falling within a predefined threshold as being substantially correlated;
a vector module generating a vector reflecting latent semantic similarities discovered between substantially correlated documents logically projected at an angle θ
from a common axis in a concept space;
a cluster module forming one or more arbitrary clusters at an angle σ
from the common axis in the concept space, each cluster comprising documents having such an angle θ
falling within a predefined variance of the angle σ
for the cluster, and constructing a new arbitrary cluster at an angle σ
from the common axis in the concept space, each new cluster comprising documents having such an angle θ
falling outside the predefined variance of the angle σ
for the remaining clusters.
12 Assignments
0 Petitions
Accused Products
Abstract
A system and method for efficiently generating cluster groupings in a multi-dimensional concept space is described. A plurality of terms are extracted from each document in a collection of stored unstructured documents. A concept space is built over the document collection. Terms substantially correlated between a plurality of documents within the document collection are identified. Each correlated term is expressed as a vector mapped along an angle θ originating from a common axis in the concept space. A difference between the angle θ for each document and an angle σ for each cluster within the concept space is determined. Each such cluster is populated with those documents having such difference between the angle θ for each such document and the angle σ for each such cluster falling within a predetermined variance. A new cluster is created within the concept space those documents having such difference between the angle θ for each such document and the angle σ for each such cluster falling outside the predetermined variance.
-
Citations
32 Claims
-
1. A system for building a multi-dimensional semantic concept space over a stored document collection, comprising:
-
an extraction module identifying a plurality of documents within a stored document collection containing substantially correlated terms reflecting syntactic content, comprising;
an extractor extracting the terms in literal form from the documents;
a selector selecting the terms having frequencies of occurrence falling within a predefined threshold as being substantially correlated;
a vector module generating a vector reflecting latent semantic similarities discovered between substantially correlated documents logically projected at an angle θ
from a common axis in a concept space;
a cluster module forming one or more arbitrary clusters at an angle σ
from the common axis in the concept space, each cluster comprising documents having such an angle θ
falling within a predefined variance of the angle σ
for the cluster, and constructing a new arbitrary cluster at an angle σ
from the common axis in the concept space, each new cluster comprising documents having such an angle θ
falling outside the predefined variance of the angle σ
for the remaining clusters.- View Dependent Claims (2, 3, 4)
a reevaluation module reevaluating the clusters until the angle θ
for substantially each document becomes minimized within the predetermined variance of the angle σ
for one such cluster.
-
-
3. A system according to claim 1, further comprising:
a finalization module finalizing the clusters, comprising at least one of merging a plurality of clusters into a single cluster, splitting a cluster into a plurality of clusters, and removing at least one of a minimal or outlier cluster.
-
4. A system according to claim 1, further comprising:
a generation module generating the clusters through k-means clustering.
-
5. A method for building a multi-dimensional semantic concept space over a stored document collection, comprising:
-
identifying a plurality of documents within a stored document collection containing substantially correlated terms reflecting syntactic content, comprising;
extracting the terms in literal form from the documents;
selecting the terms having frequencies of occurrence falling within a predefined threshold as being substantially correlated;
generating a vector reflecting latent semantic similarities discovered between substantially correlated documents logically projected at an angle θ
from a common axis in a concept space;
forming one or more arbitrary clusters at an angle σ
from the common axis in the concept space, each cluster comprising documents having such an angle θ
falling within a predefined variance of the angle σ
for the cluster; and
constructing a new arbitrary cluster at an angle σ
from the common axis in the concept space, each new cluster comprising documents having such an angle θ
falling outside the predefined variance of the angle σ
for the remaining clusters.- View Dependent Claims (6, 7, 8, 9)
reevaluating the clusters until the angle θ
for substantially each document becomes minimized within the predetermined variance of the angle σ
for one such cluster.
-
-
7. A method according to claim 5, further comprising:
finalizing the clusters, comprising at least one of merging a plurality of clusters into a single cluster, splitting a cluster into a plurality of clusters, and removing at least one of a minimal or outlier cluster.
-
8. A method according to claim 5, further comprising:
generating the clusters through k-means clustering.
-
9. A computer-readable storage medium holding code for performing the method according to claims 5, 6, 7, or 8.
-
10. A system for efficiently generating cluster groupings in a multi-dimensional concept space, comprising:
-
an extraction module extracting a plurality of terms from each document in a collection of stored unstructured documents, comprising;
an extractor extracting the terms in literal form from the documents;
a selector selecting the terms having frequencies of occurrence falling within a redefined threshold as being substantially correlated; and
a cluster module building a concept space over the document collection, comprising;
an identifier submodule identifying terms substantially correlated between a plurality of documents within the document collection;
a mapping submodule expressing each correlated term as a vector mapped along an angle θ
originating from a common axis in the concept space;
a difference submodule determining a difference between the angle θ
for each document and an angle σ
for each cluster within the concept space;
a build submodule populating an arbitrary cluster with those documents having such difference between the angle θ
for each such document and the angle σ
for each such cluster falling within a predetermined variance and creating a new arbitrary cluster within the concept space those documents having such difference between the angle θ
for each such document and the angle σ
for each such cluster falling outside the predetermined variance.- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
a rebuild module iteratively rebuilding the concept space until the angle θ
for substantially each document falls within a minimized distance within the predetermined variance of the angle σ
for one such cluster.
-
-
12. A system according to claim 10, further comprising:
a formation module forming a plurality of terms into at least one phrase.
-
13. A system according to claim 10, further comprising:
a formation module forming a plurality of concepts into at least one theme.
-
14. A system according to claim 10, further comprising:
a calculation module calculating a cosine representing a difference between the angle θ and
the common axis.
-
15. A system according to claim 10, further comprising:
a normalize submodule normalizing each vector.
-
16. A system according to claim 10, further comprising:
a histogram module determining a histogram of concepts in each unstructured document, each concept representing a term occurring in one or more of the unstructured documents.
-
17. A system according to claim 10, further comprising:
a corpus module determining a frequency of occurrences of concepts in the collection of unstructured documents, each concept representing a term occurring in one or more of the unstructured documents.
-
18. A system according to claim 10, further comprising:
a merger module merging a plurality of clusters into a single cluster.
-
19. A system according to claim 10, further comprising:
a splitter module splitting a cluster into a plurality of clusters.
-
20. A system according to claim 10, further comprising:
a filter module removing at least one of a minimal or outlier cluster.
-
21. A method for efficiently generating cluster groupings in a multi-dimensional concept space, comprising:
-
extracting a plurality of terms from each document in a collection of stored unstructured documents; and
building a concept space over the document collection, comprising;
identifying terms substantially correlated between a plurality of documents within the document collection, comprising;
extracting the terms in literal form from the documents;
selecting the terms having frequencies of occurrence falling within a predefined threshold as being substantially correlated;
expressing each correlated term as a vector mapped along an angle θ
originating from a common axis in the concept space;
determining a difference between the angle θ
for each document and an angle σ
for each cluster within the concept space;
populating an arbitrary cluster with those documents having such difference between the angle θ
for each such document and the angle σ
for each such cluster falling within a predetermined variance; and
creating a new arbitrary cluster within the concept space those documents having such difference between the angle θ
for each such document and the angle σ
for each such cluster falling outside the predetermined variance.- View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)
iteratively rebuilding the concept space until the angle θ
for substantially each document falls within a minimized distance within the predetermined variance of the angle σ
for one such cluster.
-
-
23. A method according to claim 21, further comprising:
forming a plurality of terms into at least one phrase.
-
24. A method according to claim 21, further comprising:
forming a plurality of concepts into at least one theme.
-
25. A method according to claim 21, further comprising:
calculating a cosine representing a difference between the angle θ and
the common axis.
-
26. A method according to claim 21, further comprising:
normalizing each vector.
-
27. A method according to claim 21, further comprising:
determining a histogram of concepts in each unstructured document, each concept representing a term occurring in one or more of the unstructured documents.
-
28. A method according to claim 21, further comprising:
determining a frequency of occurrences of concepts in the collection of unstructured documents, each concept representing a term occurring in one or more of the unstructured documents.
-
29. A method according to claim 21, further comprising:
merging a plurality of clusters into a single cluster.
-
30. A method according to claim 21, further comprising:
splitting a cluster into a plurality of clusters.
-
31. A method according to claim 21, further comprising:
removing at least one of a minimal or outlier cluster.
-
32. A computer-readable storage medium holding code for performing the method according to claims 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31.
Specification