Majority schema in semi-structured data
First Claim
1. A method for discovering a majority schema from a set of related documents that share similar schemas, comprising:
- extracting a set of schematic structures of the documents;
converting the schematic structures to sets of label paths;
discovering a set of frequent label paths from amongst the sets of label paths;
unifying similar schematic structures of the documents based on the set of frequent label paths that represents a majority schema;
expressing the majority schema in a predefined language;
wherein extracting schematic structures of the documents includes representing the schematic structures as sets of ordered trees with nodes labeled by a set of keywords;
wherein extracting schematic structures includes acquiring XML documents;
wherein extracting schematic structures includes placing title keywords and content keywords in ordered trees according to a specified depth;
wherein discovering a set of frequent label paths includes introducing a constraint mechanism to specify a restriction on the schematic structures in the majority schema, to help reduce noise and to improve efficiency; and
wherein discovering a set of frequent label paths further includes discovering a set of frequent label paths satisfying the constraint mechanism.
3 Assignments
0 Petitions
Accused Products
Abstract
A schema discovery system and associated method discover a majority schema for a set of related and similarly marked up documents, such as HTML documents, based on the assumption that though the structure of these documents is mostly for visual purposes, the keywords used in the documents along with the structural tags provide some hints, and allow a rough sketch of the underlying intended schema. With the assumption that albeit the set of HTML documents are marked up differently due to diverse authoring skills, they are closely related in content, it is reasonable to find a schema that can unify these different schemas, which schema is shared by the majority of these HTML documents. The system employs constraint rules on tree ordering to reduce the computational complexity in arriving at optimized XML DTD schema. These generalized XML DTD schemas may be used to perform automated comparison and evaluation schemes of profile documents on the WWW.
-
Citations
9 Claims
-
1. A method for discovering a majority schema from a set of related documents that share similar schemas, comprising:
-
extracting a set of schematic structures of the documents;
converting the schematic structures to sets of label paths;
discovering a set of frequent label paths from amongst the sets of label paths;
unifying similar schematic structures of the documents based on the set of frequent label paths that represents a majority schema;
expressing the majority schema in a predefined language;
wherein extracting schematic structures of the documents includes representing the schematic structures as sets of ordered trees with nodes labeled by a set of keywords;
wherein extracting schematic structures includes acquiring XML documents;
wherein extracting schematic structures includes placing title keywords and content keywords in ordered trees according to a specified depth;
wherein discovering a set of frequent label paths includes introducing a constraint mechanism to specify a restriction on the schematic structures in the majority schema, to help reduce noise and to improve efficiency; and
wherein discovering a set of frequent label paths further includes discovering a set of frequent label paths satisfying the constraint mechanism. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A computer program product for discovering a majority schema from a set of related documents that share similar schemas, comprising:
-
a schema discovery system for extracting a set of schematic structures of the documents;
the schema discovery system converting the schematic structures to sets of label paths;
the schema discovery system discovering a set of frequent label paths from amongst the sets of label paths;
the schema discovery system unifying similar schematic structures of the documents based on the set of frequent label paths; and
the schema discovery system expressing the set of frequent label paths in a predefined language;
wherein the schema discovery system extracts the set of schematic structures of the documents by representing the schematic structures as sets of ordered trees with nodes labeled by a set of keywords;
wherein the schematic structures include XML documents;
wherein the schematic structures are extracted by placing title keywords and content keywords in ordered trees according to a specified depth;
wherein the schema discovery system discovers a set of frequent label paths by introducing a constraint mechanism to specify a restriction on the schematic structures in the majority schema, to help reduce noise and to improve efficiency; and
wherein the schema discovery system further discovers a set of frequent label paths satisfying the constraint mechanism. - View Dependent Claims (7, 8, 9)
-
Specification