Method for organizing semi-structured data into a taxonomy, based on tag-separated clustering
First Claim
1. A method for organizing semi-structured data into a taxonomy, based on Tag-Separated (TS) clustering, said method comprising:
- retrieving documents including said semi-structured data, said semi-structured data comprising structured data including structured data fields and tags, and unstructured data;
selecting a structured attribute type including any of a categorical attribute, a numerical attribute, and a tag associated with annotated text, and an unstructured attribute type including a text attribute;
clustering said semi-structured data from said retrieved documents into a plurality of clusters based on said selected structured attribute type,wherein for a categorical attribute, each category corresponds to a single cluster;
wherein for a numerical attribute, a clustering algorithm clusters numerical data projected onto a range of said numerical attribute;
wherein for an annotated text attribute, a monothetic clustering algorithm clusters annotated text data according to tags associated with a vocabulary for said annotated text data;
and said selected unstructured attribute type,wherein for said text attribute, a monothetic clustering algorithm clusters text data with respect to said text attribute;
ranking said plurality of clusters with respect to each selected structured and unstructured attribute type wherein clusters of said plurality of clusters are ranked based on a criterion comprising coverage provided by a number of data points in a cluster;
ranking said selected structured and unstructured attribute types relative to each other based on a ranking measure, wherein said selected structured and unstructured attribute types are ranked based on entropy of corresponding data for each selected structured and unstructured attribute type; and
outputting documents, based on said ranking measure and said ranking said plurality of clusters, as said organizing to a user.
1 Assignment
0 Petitions
Accused Products
Abstract
A method organizes semi-structured data into a taxonomy, based on Tag-Separated (TS) clustering. The method comprises retrieving documents including the semi-structured data. The semi-structured data comprises structured data including structured data fields and tags, and unstructured data. The method selects a structured attribute type including any of a categorical attribute, a numerical attribute, and a tag associated with annotated text, and an unstructured attribute type including a text attribute. The method clusters the semi-structured data from the retrieved documents into a plurality of clusters based on the selected structured attribute type and the selected unstructured attribute type. For a categorical attribute, each category corresponds to a single cluster. For a numerical attribute, a clustering algorithm clusters numerical data projected onto a range of the numerical attribute. For an annotated text attribute, a monothetic clustering algorithm clusters annotated text data according to tags associated with a vocabulary for the annotated text data.
-
Citations
12 Claims
-
1. A method for organizing semi-structured data into a taxonomy, based on Tag-Separated (TS) clustering, said method comprising:
-
retrieving documents including said semi-structured data, said semi-structured data comprising structured data including structured data fields and tags, and unstructured data; selecting a structured attribute type including any of a categorical attribute, a numerical attribute, and a tag associated with annotated text, and an unstructured attribute type including a text attribute; clustering said semi-structured data from said retrieved documents into a plurality of clusters based on said selected structured attribute type, wherein for a categorical attribute, each category corresponds to a single cluster; wherein for a numerical attribute, a clustering algorithm clusters numerical data projected onto a range of said numerical attribute; wherein for an annotated text attribute, a monothetic clustering algorithm clusters annotated text data according to tags associated with a vocabulary for said annotated text data; and said selected unstructured attribute type, wherein for said text attribute, a monothetic clustering algorithm clusters text data with respect to said text attribute; ranking said plurality of clusters with respect to each selected structured and unstructured attribute type wherein clusters of said plurality of clusters are ranked based on a criterion comprising coverage provided by a number of data points in a cluster; ranking said selected structured and unstructured attribute types relative to each other based on a ranking measure, wherein said selected structured and unstructured attribute types are ranked based on entropy of corresponding data for each selected structured and unstructured attribute type; and outputting documents, based on said ranking measure and said ranking said plurality of clusters, as said organizing to a user. - View Dependent Claims (2, 11)
-
-
3. A method for organizing semi-structured data into a taxonomy based on Tag-Mixed (TM) clustering, said method comprising:
-
retrieving data samples including said semi-structured data, said semi-structured data comprising structured data including structured data fields and tags, and unstructured data; generating a vocabulary of items from said semi-structured data based on a structured attribute type including any of a categorical attribute, a numerical attribute, and a tag associated with annotated text, and an unstructured attribute type including a text attribute; adding all possible values of one or more structured attribute types and said unstructured attribute type to said generated vocabulary, wherein each item of said generated vocabulary comprises a set of tokens corresponding to a data sample that is part of said generated vocabulary; initializing an inverted index for said generated vocabulary; for each said data sample, determining said structured and unstructured attribute type and adding said set of tokens associated with each of said structured and unstructured attribute types to said inverted index; clustering said semi-structured data by applying monothetic clustering to said items in said generated vocabulary and said data samples, corresponding to one or more tokens, of said semi-structured data that comprise said items to provide a plurality of clusters; ranking said plurality of clusters with respect to each said structured attribute type and said unstructured attribute type, wherein clusters of said plurality of clusters are ranked based on a criterion comprising coverage provided by a number of data points in a cluster; and outputting data samples, based on said monothetic clustering of said items and said data samples, as said organizing to a user. - View Dependent Claims (4, 5, 6, 7, 12)
-
-
8. A program storage device readable by machine, tangibly embodying a program of instructions executable by said machine to perform a method for organizing semi-structured data into a taxonomy, based on Tag-Mixed (TM) clustering, said method comprising:
-
retrieving documents including said semi-structured data, said semi-structured data comprising structured data including structured data fields and tags, and unstructured data; selecting a structured attribute type including any of a categorical attribute, a numerical attribute, and a tag associated with annotated text, and an unstructured attribute type including a text attribute; clustering said semi-structured data from said retrieved documents into a plurality of clusters based on said selected structured attribute type, wherein for a categorical attribute, each category corresponds to a single cluster; wherein for a numerical attribute, a clustering algorithm clusters numerical data projected onto a range of said numerical attribute; wherein for an annotated text attribute, a monothetic clustering algorithm clusters annotated text data according to tags associated with a vocabulary for said annotated text data; and said selected unstructured attribute type, wherein for said text attribute, a monothetic clustering algorithm clusters text data with respect to said text attribute; ranking said plurality of clusters with respect to each selected structured and unstructured attribute type, wherein clusters of said plurality of clusters are ranked based on a criterion coverage provided by a number of data points in a cluster; ranking said selected structured and unstructured attribute types relative to each other based on a ranking measure, wherein said selected structured and unstructured attribute types are ranked based on entropy of corresponding data for each selected structured and unstructured attribute type; and outputting documents, based on said ranking measure and said ranking said plurality of clusters, as said organizing to a user.
-
-
9. A program storage device readable by machine, tangibly embodying a program of instructions executable by said machine to perform a method for organizing semi-structured data into a taxonomy, based on Tag-Separated (TS) clustering, said method comprising:
-
retrieving data samples including said semi-structured data, said semi-structured data comprising structured data including structured data fields and tags, and unstructured data; generating a vocabulary of items from said semi-structured data based on a structured attribute type including any of a categorical attribute, a numerical attribute, and a tag associated with annotated text, and unstructured attribute type including a text attribute; adding all possible values of one or more structured attribute types and said unstructured attribute type to said generated vocabulary, wherein each item of said generated vocabulary comprises a set of tokens corresponding to a data sample that is part of said generated vocabulary; initializing an inverted index for said generated vocabulary; for each said data sample, determining said structured and unstructured attribute type and adding said set of tokens associated with each of said structured and unstructured attribute types to said inverted index; clustering said semi-structured data by applying monothetic clustering to said items in said generated vocabulary and said data samples, corresponding to one or more tokens, of said semi-structured data that comprise said items to provide a plurality of clusters; ranking said plurality of clusters with respect to each said structured attribute type and said unstructured attribute type, wherein clusters of said plurality of clusters are ranked based on a criterion comprising coverage provided by a number of data points in a cluster; and outputting data samples, based on said monothetic clustering of said items and said data samples, as said organizing to a user. - View Dependent Claims (10)
-
Specification