×

Method for organizing semi-structured data into a taxonomy, based on tag-separated clustering

  • US 7,502,765 B2
  • Filed: 12/21/2005
  • Issued: 03/10/2009
  • Est. Priority Date: 12/21/2005
  • Status: Active Grant
First Claim
Patent Images

1. A method for organizing semi-structured data into a taxonomy, based on Tag-Separated (TS) clustering, said method comprising:

  • retrieving documents including said semi-structured data, said semi-structured data comprising structured data including structured data fields and tags, and unstructured data;

    selecting a structured attribute type including any of a categorical attribute, a numerical attribute, and a tag associated with annotated text, and an unstructured attribute type including a text attribute;

    clustering said semi-structured data from said retrieved documents into a plurality of clusters based on said selected structured attribute type,wherein for a categorical attribute, each category corresponds to a single cluster;

    wherein for a numerical attribute, a clustering algorithm clusters numerical data projected onto a range of said numerical attribute;

    wherein for an annotated text attribute, a monothetic clustering algorithm clusters annotated text data according to tags associated with a vocabulary for said annotated text data;

    and said selected unstructured attribute type,wherein for said text attribute, a monothetic clustering algorithm clusters text data with respect to said text attribute;

    ranking said plurality of clusters with respect to each selected structured and unstructured attribute type wherein clusters of said plurality of clusters are ranked based on a criterion comprising coverage provided by a number of data points in a cluster;

    ranking said selected structured and unstructured attribute types relative to each other based on a ranking measure, wherein said selected structured and unstructured attribute types are ranked based on entropy of corresponding data for each selected structured and unstructured attribute type; and

    outputting documents, based on said ranking measure and said ranking said plurality of clusters, as said organizing to a user.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×