System and method for efficiently generating cluster groupings in a multi-dimensional concept space
First Claim
1. A system for creating stored cluster representations of document semantics, comprising:
- a text analyzer configured to order concepts contained in documents selected from a document store by overall frequencies of occurrence to form a corpus of the documents;
a document selection module configured to select a subset of the documents in the corpus that contain those concepts having frequencies of occurrence that occur within a bounded range of concept frequencies, comprising;
a median determination submodule configured to set a median for the bounded range by document type;
a bounded range determination submodule configured to establish upper and lower edge conditions of the bounded range relative to the median; and
a selection submodule configured to select the documents that occur within the upper and lower edge conditions;
a cluster module configured to assign the documents in the subset into clusters, comprising;
an initial cluster submodule configured to group those documents from the subset that contain matching concepts into an arbitrary cluster for each of the matching concepts;
a distance determination submodule configured to determine Euclidian distances between each of the arbitrary clusters and each remaining document that is not yet grouped into a cluster and to apply a variance of five percent to the Euclidean distances; and
a secondary cluster submodule configured to place each remaining document into the arbitrary cluster for which the Euclidean distance between that remaining document and that arbitrary cluster falls within the variance;
a cluster formation module configured to form a new arbitrary cluster for each remaining document that was not previously placed in one of the arbitrary clusters and that is associated with Euclidean distances that all fall outside the variance;
a database configured to finalize and store the arbitrary clusters; and
a processor configured to execute the modules.
13 Assignments
0 Petitions
Accused Products
Abstract
A system and method for efficiently generating cluster groupings in a multi-dimensional concept space is described. A plurality of terms is extracted from each document in a collection of stored unstructured documents. A concept space is built over the document collection. Terms substantially correlated between a plurality of documents within the document collection are identified. Each correlated term is expressed as a vector mapped along an angle θ originating from a common axis in the concept space. A difference between the angle θ for each document and an angle σ for each cluster within the concept space is determined. Each such cluster is populated with those documents having such difference between the angle θ for each such document and the angle σ for each such cluster falling within a predetermined variance. A new cluster is created within the concept space those documents having such difference between the angle θ for each such document and the angle σ for each such cluster falling outside the predetermined variance.
-
Citations
25 Claims
-
1. A system for creating stored cluster representations of document semantics, comprising:
-
a text analyzer configured to order concepts contained in documents selected from a document store by overall frequencies of occurrence to form a corpus of the documents; a document selection module configured to select a subset of the documents in the corpus that contain those concepts having frequencies of occurrence that occur within a bounded range of concept frequencies, comprising; a median determination submodule configured to set a median for the bounded range by document type; a bounded range determination submodule configured to establish upper and lower edge conditions of the bounded range relative to the median; and a selection submodule configured to select the documents that occur within the upper and lower edge conditions; a cluster module configured to assign the documents in the subset into clusters, comprising; an initial cluster submodule configured to group those documents from the subset that contain matching concepts into an arbitrary cluster for each of the matching concepts; a distance determination submodule configured to determine Euclidian distances between each of the arbitrary clusters and each remaining document that is not yet grouped into a cluster and to apply a variance of five percent to the Euclidean distances; and a secondary cluster submodule configured to place each remaining document into the arbitrary cluster for which the Euclidean distance between that remaining document and that arbitrary cluster falls within the variance; a cluster formation module configured to form a new arbitrary cluster for each remaining document that was not previously placed in one of the arbitrary clusters and that is associated with Euclidean distances that all fall outside the variance; a database configured to finalize and store the arbitrary clusters; and a processor configured to execute the modules. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method for creating stored cluster representations of document semantics, comprising:
-
ordering concepts contained in documents selected from a document store by overall frequencies of occurrence to form a corpus of the documents; selecting a subset of the documents in the corpus that contain those concepts having frequencies of occurrence that occur within a bounded range of concept frequencies, comprising; setting a median for the bounded range by document type; establishing upper and lower edge conditions of the bounded range relative to the median; and selecting the documents that occur within the upper and lower edge conditions; assigning the documents in the subset into clusters, comprising; grouping those documents from the subset that contain matching concepts into an arbitrary cluster for each of the matching concepts; determining Euclidian distances between each of the arbitrary clusters and each remaining document that is not yet grouped into a cluster; applying a variance of five percent to the Euclidean distances; and placing each remaining document into the arbitrary cluster for which the Euclidean distance between that remaining document and that cluster falls within the variance; forming a new arbitrary cluster for each remaining document that was not previously placed in one of the arbitrary clusters and that is associated with Euclidean distances that all fall outside the variance; and finalizing and storing the arbitrary clusters. - View Dependent Claims (8, 9, 10, 11, 12, 13, 14)
-
-
15. A system for displaying stored cluster representations of document semantics, comprising:
-
a document store configured to store documents containing one or more concepts; a text analyzer configured to order concepts parsed from the documents by overall frequencies of occurrence in the documents and further configured to identify those concepts having frequencies of occurrence that occur within upper and lower thresholds for concept frequencies; an initial cluster module configured to choose sets of the identified concepts that match and further configured to create an arbitrary cluster for those of the documents corresponding to each set of concepts that match; a further cluster module configured to place each remaining document that is not yet in a cluster, comprising; a distance measuring submodule configured to determine Euclidian distances between each arbitrary cluster and an origin and the Euclidian distance between the remaining document and the origin; a distance evaluation submodule configured to evaluate the Euclidian distances of the arbitrary clusters against the Euclidian distance of the remaining document; and a document placer submodule configured to place the remaining document into the arbitrary cluster at minimal variance of five percent from the remaining document and into a new arbitrary cluster when the arbitrary clusters exceed the variance and the remaining document was not previously placed in one of the arbitrary clusters; a visualization module configured to present the arbitrary clusters projected onto a two-dimensional display space; and a processor configured to execute the modules. - View Dependent Claims (16, 17, 18, 19)
-
-
20. A method for displaying stored cluster representations of document semantics, comprising:
-
selecting concepts parsed from documents stored in a document store; ordering the concepts by overall frequencies of occurrence in the documents and identifying those concepts having frequencies of occurrence that occur within upper and lower thresholds for concept frequencies; choosing sets of the identified concepts that match and creating an arbitrary cluster for those of the documents corresponding to each set of concepts that match; placing each remaining document that is not yet in a cluster, comprising; determining Euclidian distances between each arbitrary cluster and an origin and the Euclidian distance between the remaining document and the origin; evaluating the Euclidian distances of the arbitrary clusters against the Euclidian distance of the remaining document; and placing the remaining document into the arbitrary cluster at minimal variance of five percent from the remaining document and into a new arbitrary cluster when the clusters exceed the variance and the remaining document was not previously placed in one of the clusters; and presenting the arbitrary clusters projected onto a two-dimensional display space. - View Dependent Claims (21, 22, 23, 24, 25)
-
Specification