Majority schema in semi-structured data

US 6,604,099 B1
Filed: 07/27/2000
Issued: 08/05/2003
Est. Priority Date: 03/20/2000
Status: Expired due to Term

First Claim

Patent Images

1. A method for discovering a majority schema from a set of related documents that share similar schemas, comprising:

extracting a set of schematic structures of the documents;

converting the schematic structures to sets of label paths;

discovering a set of frequent label paths from amongst the sets of label paths;

unifying similar schematic structures of the documents based on the set of frequent label paths that represents a majority schema;

expressing the majority schema in a predefined language;

wherein extracting schematic structures of the documents includes representing the schematic structures as sets of ordered trees with nodes labeled by a set of keywords;

wherein extracting schematic structures includes acquiring XML documents;

wherein extracting schematic structures includes placing title keywords and content keywords in ordered trees according to a specified depth;

wherein discovering a set of frequent label paths includes introducing a constraint mechanism to specify a restriction on the schematic structures in the majority schema, to help reduce noise and to improve efficiency; and

wherein discovering a set of frequent label paths further includes discovering a set of frequent label paths satisfying the constraint mechanism.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A schema discovery system and associated method discover a majority schema for a set of related and similarly marked up documents, such as HTML documents, based on the assumption that though the structure of these documents is mostly for visual purposes, the keywords used in the documents along with the structural tags provide some hints, and allow a rough sketch of the underlying intended schema. With the assumption that albeit the set of HTML documents are marked up differently due to diverse authoring skills, they are closely related in content, it is reasonable to find a schema that can unify these different schemas, which schema is shared by the majority of these HTML documents. The system employs constraint rules on tree ordering to reduce the computational complexity in arriving at optimized XML DTD schema. These generalized XML DTD schemas may be used to perform automated comparison and evaluation schemes of profile documents on the WWW.

Citations

9 Claims

1. A method for discovering a majority schema from a set of related documents that share similar schemas, comprising:
- extracting a set of schematic structures of the documents;
  
  converting the schematic structures to sets of label paths;
  
  discovering a set of frequent label paths from amongst the sets of label paths;
  
  unifying similar schematic structures of the documents based on the set of frequent label paths that represents a majority schema;
  
  expressing the majority schema in a predefined language;
  
  wherein extracting schematic structures of the documents includes representing the schematic structures as sets of ordered trees with nodes labeled by a set of keywords;
  
  wherein extracting schematic structures includes acquiring XML documents;
  
  wherein extracting schematic structures includes placing title keywords and content keywords in ordered trees according to a specified depth;
  
  wherein discovering a set of frequent label paths includes introducing a constraint mechanism to specify a restriction on the schematic structures in the majority schema, to help reduce noise and to improve efficiency; and
  
  wherein discovering a set of frequent label paths further includes discovering a set of frequent label paths satisfying the constraint mechanism.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method according to claim 1, wherein extracting schematic structures includes using reordering rules to reconfigure the trees.
  - 3. The method according to claim 1, wherein converting the schematic structures to sets of label paths includes mapping ignoring repetitive information.
  - 4. The method according to claim 1, wherein unifying similar schematic structures includes using a clustering approach based on a predetermined tree distance.
  - 5. The method according to claim 1, wherein converting the set of frequent label paths to a predefined structured schema includes converting the set of frequent label paths to an XML DTD schema.

6. A computer program product for discovering a majority schema from a set of related documents that share similar schemas, comprising:
- a schema discovery system for extracting a set of schematic structures of the documents;
  
  the schema discovery system converting the schematic structures to sets of label paths;
  
  the schema discovery system discovering a set of frequent label paths from amongst the sets of label paths;
  
  the schema discovery system unifying similar schematic structures of the documents based on the set of frequent label paths; and
  
  the schema discovery system expressing the set of frequent label paths in a predefined language;
  
  wherein the schema discovery system extracts the set of schematic structures of the documents by representing the schematic structures as sets of ordered trees with nodes labeled by a set of keywords;
  
  wherein the schematic structures include XML documents;
  
  wherein the schematic structures are extracted by placing title keywords and content keywords in ordered trees according to a specified depth;
  
  wherein the schema discovery system discovers a set of frequent label paths by introducing a constraint mechanism to specify a restriction on the schematic structures in the majority schema, to help reduce noise and to improve efficiency; and
  
  wherein the schema discovery system further discovers a set of frequent label paths satisfying the constraint mechanism.
- View Dependent Claims (7, 8, 9)
- - 7. The computer program product according to claim 6, wherein the schema discovery system acquires the set of related documents from the World Wide Web.
  - 8. The computer program product according to claim 6, wherein the set of related documents includes XML documents.
  - 9. The computer program product according to claim 8, wherein the majority schema includes a majority XML DTD schema.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
International Business Machines Corporation
Inventors
Sundaresan, Neelakantan, Chung, Christina Yip
Primary Examiner(s)
Popovici, Dov
Assistant Examiner(s)
Mahmoudi, Hassan

Application Number

US09/628,097
Time in Patent Office

1,104 Days
Field of Search

707/3, 707/513, 707/100, 707/10, 707/4, 707/5, 707/6, 707/104.1, 707/205, 707/2, 717/104, 705/14, 705/10, 709/315, 709/223, 709/220, 713/176
US Class Current

1/1
CPC Class Codes

G06F 16/81   Indexing, e.g. XML tags; Da...

G06F 16/951   Indexing; Web crawling tech...

G06F 16/986   Document structures and sto...

Y10S 707/99933   Query processing, i.e. sear...

Majority schema in semi-structured data

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

9 Claims

Specification

Solutions

Use Cases

Quick Links

Majority schema in semi-structured data

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

9 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links