Method for organizing semi-structured data into a taxonomy, based on tag-separated clustering

US 7,502,765 B2
Filed: 12/21/2005
Issued: 03/10/2009
Est. Priority Date: 12/21/2005
Status: Active Grant

First Claim

Patent Images

1. A method for organizing semi-structured data into a taxonomy, based on Tag-Separated (TS) clustering, said method comprising:

retrieving documents including said semi-structured data, said semi-structured data comprising structured data including structured data fields and tags, and unstructured data;

selecting a structured attribute type including any of a categorical attribute, a numerical attribute, and a tag associated with annotated text, and an unstructured attribute type including a text attribute;

clustering said semi-structured data from said retrieved documents into a plurality of clusters based on said selected structured attribute type,wherein for a categorical attribute, each category corresponds to a single cluster;

wherein for a numerical attribute, a clustering algorithm clusters numerical data projected onto a range of said numerical attribute;

wherein for an annotated text attribute, a monothetic clustering algorithm clusters annotated text data according to tags associated with a vocabulary for said annotated text data;

and said selected unstructured attribute type,wherein for said text attribute, a monothetic clustering algorithm clusters text data with respect to said text attribute;

ranking said plurality of clusters with respect to each selected structured and unstructured attribute type wherein clusters of said plurality of clusters are ranked based on a criterion comprising coverage provided by a number of data points in a cluster;

ranking said selected structured and unstructured attribute types relative to each other based on a ranking measure, wherein said selected structured and unstructured attribute types are ranked based on entropy of corresponding data for each selected structured and unstructured attribute type; and

outputting documents, based on said ranking measure and said ranking said plurality of clusters, as said organizing to a user.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method organizes semi-structured data into a taxonomy, based on Tag-Separated (TS) clustering. The method comprises retrieving documents including the semi-structured data. The semi-structured data comprises structured data including structured data fields and tags, and unstructured data. The method selects a structured attribute type including any of a categorical attribute, a numerical attribute, and a tag associated with annotated text, and an unstructured attribute type including a text attribute. The method clusters the semi-structured data from the retrieved documents into a plurality of clusters based on the selected structured attribute type and the selected unstructured attribute type. For a categorical attribute, each category corresponds to a single cluster. For a numerical attribute, a clustering algorithm clusters numerical data projected onto a range of the numerical attribute. For an annotated text attribute, a monothetic clustering algorithm clusters annotated text data according to tags associated with a vocabulary for the annotated text data.

Citations

12 Claims

1. A method for organizing semi-structured data into a taxonomy, based on Tag-Separated (TS) clustering, said method comprising:
- retrieving documents including said semi-structured data, said semi-structured data comprising structured data including structured data fields and tags, and unstructured data;
  
  selecting a structured attribute type including any of a categorical attribute, a numerical attribute, and a tag associated with annotated text, and an unstructured attribute type including a text attribute;
  
  clustering said semi-structured data from said retrieved documents into a plurality of clusters based on said selected structured attribute type,wherein for a categorical attribute, each category corresponds to a single cluster;
  
  wherein for a numerical attribute, a clustering algorithm clusters numerical data projected onto a range of said numerical attribute;
  
  wherein for an annotated text attribute, a monothetic clustering algorithm clusters annotated text data according to tags associated with a vocabulary for said annotated text data;
  
  and said selected unstructured attribute type,wherein for said text attribute, a monothetic clustering algorithm clusters text data with respect to said text attribute;
  
  ranking said plurality of clusters with respect to each selected structured and unstructured attribute type wherein clusters of said plurality of clusters are ranked based on a criterion comprising coverage provided by a number of data points in a cluster;
  
  ranking said selected structured and unstructured attribute types relative to each other based on a ranking measure, wherein said selected structured and unstructured attribute types are ranked based on entropy of corresponding data for each selected structured and unstructured attribute type; and
  
  outputting documents, based on said ranking measure and said ranking said plurality of clusters, as said organizing to a user.
- View Dependent Claims (2, 11)
- - 2. The method of claim 1, further comprising representing said taxonomy as a hierarchical tree structure comprising a root node and a plurality of child nodes, said root node containing said semi-structured data and said each of said child nodes containing data points of a cluster generated front said semi-structured data.
  - 11. The method of claim 1, wherein said monothetic clustering algorithm generates single level labeled clusters within each attribute, assign documents to a cluster based on a single feature, identify a set of concepts present in each of a collection of documents, wherein said concepts comprise words that appear in said documents and phrases extracted from said documents by natural language processing, and select subsets of said concepts, wherein said subsets are labels of said clusters, and assign documents containing a concept to a cluster comprising said concept as its label.

3. A method for organizing semi-structured data into a taxonomy based on Tag-Mixed (TM) clustering, said method comprising:
- retrieving data samples including said semi-structured data, said semi-structured data comprising structured data including structured data fields and tags, and unstructured data;
  
  generating a vocabulary of items from said semi-structured data based on a structured attribute type including any of a categorical attribute, a numerical attribute, and a tag associated with annotated text, and an unstructured attribute type including a text attribute;
  
  adding all possible values of one or more structured attribute types and said unstructured attribute type to said generated vocabulary, wherein each item of said generated vocabulary comprises a set of tokens corresponding to a data sample that is part of said generated vocabulary;
  
  initializing an inverted index for said generated vocabulary;
  
  for each said data sample, determining said structured and unstructured attribute type and adding said set of tokens associated with each of said structured and unstructured attribute types to said inverted index;
  
  clustering said semi-structured data by applying monothetic clustering to said items in said generated vocabulary and said data samples, corresponding to one or more tokens, of said semi-structured data that comprise said items to provide a plurality of clusters;
  
  ranking said plurality of clusters with respect to each said structured attribute type and said unstructured attribute type, wherein clusters of said plurality of clusters are ranked based on a criterion comprising coverage provided by a number of data points in a cluster; and
  
  outputting data samples, based on said monothetic clustering of said items and said data samples, as said organizing to a user.
- View Dependent Claims (4, 5, 6, 7, 12)
- - 4. The method of claim 3, wherein for each numerical attribute:
    - clustering said semi-structured data based on said numerical attribute; and
      
      treating said numerical attribute based cluster as a categorical attribute; and
      
      adding all possible values of each of said plurality of numerical attributes to said vocabulary.
  - 5. The method of claim 3, wherein said generating a vocabulary of items comprises:
    - for each categorical attribute, adding attribute value pairs corresponding to all possible values of the categorical attribute to the vocabulary;
      
      for each numerical attribute, clustering the data based on the numerical attribute, considering the numerical-attribute-based clusters as categorical attributes and adding corresponding attribute value pairs to the vocabulary;
      
      for each textual attribute, extracting alt possible words or phrases occurring in the values of the textual attribute and adding corresponding attribute word or phrase pairs to the vocabulary; and
      
      for each annotated textual attribute, adding all possible attribute tagged text tuplets to the vocabulary.
  - 6. The method of claim 3, wherein said clustering is performed using a monothetic clustering algorithm based on each of said attribute types.
  - 7. The method of claim 3, further comprising representing said taxonomy as a hierarchical tree structure comprising a root node and a plurality of child nodes, said root node containing said semi-structured data and each of said child nodes containing data points of a cluster generated from said semi-structured data.
  - 12. The method of claim 3, wherein said monothetic clustering algorithm generates single level labeled clusters within each attribute, assign documents to a cluster based on a single feature, identify a set of concepts present in each of a collection of documents, wherein said concepts comprise words that appear in said documents and phrases extracted from said documents by natural language processing, and select subsets of said concepts, wherein said subsets are labels of said clusters, and assign documents containing a concept to a cluster comprising said concept as its label.

8. A program storage device readable by machine, tangibly embodying a program of instructions executable by said machine to perform a method for organizing semi-structured data into a taxonomy, based on Tag-Mixed (TM) clustering, said method comprising:
- retrieving documents including said semi-structured data, said semi-structured data comprising structured data including structured data fields and tags, and unstructured data;
  
  selecting a structured attribute type including any of a categorical attribute, a numerical attribute, and a tag associated with annotated text, and an unstructured attribute type including a text attribute;
  
  clustering said semi-structured data from said retrieved documents into a plurality of clusters based on said selected structured attribute type,wherein for a categorical attribute, each category corresponds to a single cluster;
  
  wherein for a numerical attribute, a clustering algorithm clusters numerical data projected onto a range of said numerical attribute;
  
  wherein for an annotated text attribute, a monothetic clustering algorithm clusters annotated text data according to tags associated with a vocabulary for said annotated text data;
  
  and said selected unstructured attribute type,wherein for said text attribute, a monothetic clustering algorithm clusters text data with respect to said text attribute;
  
  ranking said plurality of clusters with respect to each selected structured and unstructured attribute type, wherein clusters of said plurality of clusters are ranked based on a criterion coverage provided by a number of data points in a cluster;
  
  ranking said selected structured and unstructured attribute types relative to each other based on a ranking measure, wherein said selected structured and unstructured attribute types are ranked based on entropy of corresponding data for each selected structured and unstructured attribute type; and
  
  outputting documents, based on said ranking measure and said ranking said plurality of clusters, as said organizing to a user.

9. A program storage device readable by machine, tangibly embodying a program of instructions executable by said machine to perform a method for organizing semi-structured data into a taxonomy, based on Tag-Separated (TS) clustering, said method comprising:
- retrieving data samples including said semi-structured data, said semi-structured data comprising structured data including structured data fields and tags, and unstructured data;
  
  generating a vocabulary of items from said semi-structured data based on a structured attribute type including any of a categorical attribute, a numerical attribute, and a tag associated with annotated text, and unstructured attribute type including a text attribute;
  
  adding all possible values of one or more structured attribute types and said unstructured attribute type to said generated vocabulary, wherein each item of said generated vocabulary comprises a set of tokens corresponding to a data sample that is part of said generated vocabulary;
  
  initializing an inverted index for said generated vocabulary;
  
  for each said data sample, determining said structured and unstructured attribute type and adding said set of tokens associated with each of said structured and unstructured attribute types to said inverted index;
  
  clustering said semi-structured data by applying monothetic clustering to said items in said generated vocabulary and said data samples, corresponding to one or more tokens, of said semi-structured data that comprise said items to provide a plurality of clusters;
  
  ranking said plurality of clusters with respect to each said structured attribute type and said unstructured attribute type, wherein clusters of said plurality of clusters are ranked based on a criterion comprising coverage provided by a number of data points in a cluster; and
  
  outputting data samples, based on said monothetic clustering of said items and said data samples, as said organizing to a user.
- View Dependent Claims (10)
- - 10. The computer program product of claim 9, wherein for each numerical attribute:
    - clustering said semi-structured data based on said numerical attribute; and
      
      treating said numerical attribute based cluster as a categorical attribute; and
      
      adding all possible values of each of said plurality of numerical attributes to said vocabulary.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Kummamuru, Krishna, Kankar, Pankaj
Primary Examiner(s)
VINCENT, DAVID ROBERT
Assistant Examiner(s)
Berman; Melissa

Application Number

US11/314,596
Publication Number

US 20070143235A1
Time in Patent Office

1,175 Days
Field of Search

706/20, 706/15, 703/14, 703/19
US Class Current

706/15
CPC Class Codes

G06F 16/81 Indexing, e.g. XML tags; Da...

G06F 18/231 Hierarchical techniques, i....

Method for organizing semi-structured data into a taxonomy, based on tag-separated clustering

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

12 Claims

Specification

Solutions

Use Cases

Quick Links

Method for organizing semi-structured data into a taxonomy, based on tag-separated clustering

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

12 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links