×

System and method for efficiently generating cluster groupings in a multi-dimensional concept space

  • US 8,402,026 B2
  • Filed: 08/03/2004
  • Issued: 03/19/2013
  • Est. Priority Date: 08/31/2001
  • Status: Active Grant
First Claim
Patent Images

1. A system for creating stored cluster representations of document semantics, comprising:

  • a text analyzer configured to order concepts contained in documents selected from a document store by overall frequencies of occurrence to form a corpus of the documents;

    a document selection module configured to select a subset of the documents in the corpus that contain those concepts having frequencies of occurrence that occur within a bounded range of concept frequencies, comprising;

    a median determination submodule configured to set a median for the bounded range by document type;

    a bounded range determination submodule configured to establish upper and lower edge conditions of the bounded range relative to the median; and

    a selection submodule configured to select the documents that occur within the upper and lower edge conditions;

    a cluster module configured to assign the documents in the subset into clusters, comprising;

    an initial cluster submodule configured to group those documents from the subset that contain matching concepts into an arbitrary cluster for each of the matching concepts;

    a distance determination submodule configured to determine Euclidian distances between each of the arbitrary clusters and each remaining document that is not yet grouped into a cluster and to apply a variance of five percent to the Euclidean distances; and

    a secondary cluster submodule configured to place each remaining document into the arbitrary cluster for which the Euclidean distance between that remaining document and that arbitrary cluster falls within the variance;

    a cluster formation module configured to form a new arbitrary cluster for each remaining document that was not previously placed in one of the arbitrary clusters and that is associated with Euclidean distances that all fall outside the variance;

    a database configured to finalize and store the arbitrary clusters; and

    a processor configured to execute the modules.

View all claims
  • 13 Assignments
Timeline View
Assignment View
    ×
    ×