Single pass workload directed clustering of XML documents

US 7,512,615 B2
Filed: 11/07/2003
Issued: 03/31/2009
Est. Priority Date: 11/07/2003
Status: Expired due to Fees

First Claim

Patent Images

1. A system for clustering XML documents, the system comprising:

an arrangement for parsing an XML document by node;

an arrangement for initializing at least one parsed node;

an arrangement for partitioning at least one parsed node; and

an arrangement for processing at least one parsed node;

wherein the system removes XML text data of a node prior to the entire document being clustered by detecting a ready cluster and removing the ready cluster from an intermediate partition upon assignment to a page, wherein said ready cluster is a cluster which carries with it corresponding XML text that would be part of a final partition while avoiding the need to keep the entire XML document in memory until the final partition is computed;

wherein the system utilizes a processor to cluster XML documents;

wherein the system partitions a weight range into equal size weight intervals and associates only one partition for each weight interval;

wherein given a predetermined memory limit M for managing memory usage in selecting optimal partitions, when memory usage reaches a high water mark, a corrective action is triggered to select a ready sub-partition, and when memory usage reaches a low water mark operation resumes;

wherein said ready sub-partition is a highest value partition associated with a root of a processed subtree which is a subset of a computed best partition for a whole clustering tree; and

wherein the XML clustering system processes the XML document, partitions the XML document into clusters, and assigns the clusters to pages all within a single pass.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for clustering of XML documents is disclosed. The method operates under specified memory-use constraints. The system implements the method and scans an XML document, assigns edge-weights according to the application workload, and maps clusters of XML nodes to disk pages, all in a single parser-controlled pass over the XML data. Application workload information is used to generate XML clustering solutions that lead to substantial reduction in page faults for the workload under consideration. Several approaches for representing workload information are disclosed. For example, the workload may list the XPath operators invoked during the application along with their invocation frequencies. The application workload can be further refined by incorporating additional features such as query importance or query compilation costs. XML access patterns could be also modeled using stochastic approaches.

Citations

30 Claims

1. A system for clustering XML documents, the system comprising:
- an arrangement for parsing an XML document by node;
  
  an arrangement for initializing at least one parsed node;
  
  an arrangement for partitioning at least one parsed node; and
  
  an arrangement for processing at least one parsed node;
  
  wherein the system removes XML text data of a node prior to the entire document being clustered by detecting a ready cluster and removing the ready cluster from an intermediate partition upon assignment to a page, wherein said ready cluster is a cluster which carries with it corresponding XML text that would be part of a final partition while avoiding the need to keep the entire XML document in memory until the final partition is computed;
  
  wherein the system utilizes a processor to cluster XML documents;
  
  wherein the system partitions a weight range into equal size weight intervals and associates only one partition for each weight interval;
  
  wherein given a predetermined memory limit M for managing memory usage in selecting optimal partitions, when memory usage reaches a high water mark, a corrective action is triggered to select a ready sub-partition, and when memory usage reaches a low water mark operation resumes;
  
  wherein said ready sub-partition is a highest value partition associated with a root of a processed subtree which is a subset of a computed best partition for a whole clustering tree; and
  
  wherein the XML clustering system processes the XML document, partitions the XML document into clusters, and assigns the clusters to pages all within a single pass.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The system according to claim 1, wherein the arrangement for initializing at least one parsed node comprises:
    - an arrangement for creating at least one tree node for at least one parsed node;
      
      an arrangement for providing XML workload information about at least one parsed node;
      
      an arrangement for providing at least one parent/child link and assigning an edge weight when a parsed node is a parent;
      
      an arrangement for designating a parsed node as the root of a tree when a parsed node is not a parent;
      
      an arrangement for creating a partition; and
      
      an arrangement for adding the created partition to a list of created partitions.
  - 3. The system according to claim 2, wherein the arrangement for providing XML workload information comprises:
    - an arrangement for analyzing at least one XPath query when at least one node is parsed;
      
      an arrangement for identifying at least one node that is visited by an XPath query; and
      
      an arrangement for determining the number of visits to at least one node via the parent/child edge.
  - 4. The system according to claim 2, wherein the arrangement for creating a partition creates an initial partition with node weight w′
    - and value 0.
  - 5. The system according to claim 1, wherein the arrangement for partitioning at least one parsed node comprises:
    - an arrangement for partitioning at least one parsed node by making use of a partition of a root node and partitions of at least one child node;
      
      an arrangement for creating We=W/C partitions, wherein W is a weight bound and further wherein C is a Chunk size;
      
      an arrangement for selecting a partition with a maximum value from among the Wc partitions; and
      
      an arrangement for deleting any child nodes of the at least one parsed node.
  - 6. The system according to claim 5, wherein each of the We partitions has maximum value among all partitions with the same weight.
  - 7. The system according to claim 2, wherein the nodes connected by heavier edges are mapped to the same page.
  - 8. The system according to claim 2, wherein system memory is constrained.
  - 9. The system according to claim 2, further comprising an arrangement for identifying at least one ready cluster.

10. A system for clustering XML documents, the system comprising:
- an arrangement for parsing an XML document by node;
  
  an arrangement for determining XPath work traversals of at least one parsed node;
  
  an arrangement for clustering at least one parsed node; and
  
  an arrangement for assigning at least one cluster to a page;
  
  wherein the system removes XML text data of a node prior to the entire document being clustered by detecting a ready cluster and removing the ready cluster from an intermediate partition upon assignment to a page, wherein said ready cluster is a cluster which carries with it corresponding XML text that would be part of a final partition while avoiding the need to keep the entire XML document in memory until the final partition is computed;
  
  wherein the system utilizes a processor to cluster XML documents;
  
  wherein the system partitions a weight range into equal size weight intervals and associates only one partition for each weight interval;
  
  wherein given a predetermined memory limit M for managing memory usage in selecting optimal partitions, when memory usage reaches a high water mark, a corrective action is triggered to select a ready sub-partition, and when memory usage reaches a low water mark operation resumes;
  
  wherein said ready sub-partition is a highest value partition associated with a root of a processed subtree which is a subset of a computed best partition for a whole clustering tree; and
  
  wherein the XML clustering system processes the XML document, partitions the XML document into clusters, and assigns the clusters to pages all within a single pass.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- - 11. The system according to claim 10, wherein the arrangement for parsing at least one parsed node comprises:
    - an arrangement for creating at least one tree node for at least one parsed node;
      
      an arrangement for providing XML workload information about at least one parsed node;
      
      an arrangement for providing at least one parent/child link and assigning an edge weight when a parsed node is a parent;
      
      an arrangement for designating a parsed node as the root of a tree when a parsed node is not a parent;
      
      an arrangement for creating a partition; and
      
      an arrangement for adding the created partition to a list of created partitions.
  - 12. The system according to claim 11, wherein the arrangement for providing XML workload information comprises:
    - an arrangement for analyzing at least one XPath query when at least one node is parsed;
      
      an arrangement for identifying at least one node that is visited by an XPath query; and
      
      an arrangement for determining the number of visits to at least one node via the parent/child edge.
  - 13. The system according to claim 11, wherein the arrangement for creating a partition creates an initial partition with node weight w′
    - and value 0.
  - 14. The system according to claim 10, wherein the arrangement for assigning at least one cluster to a page comprises:
    - an arrangement for partitioning at least one parsed node by making use of a partition of a root node and partitions of at least one child node;
      
      an arrangement for creating Wc=W/C partitions, wherein W is a weight bound and further wherein C is a Chunk size;
      
      an arrangement for selecting a partition with a maximum value among from among the Wc partitions; and
      
      an arrangement for deleting any child nodes of the at least one parsed node.
  - 15. The system according to claim 14, wherein each of the Wc partitions has maximum value among all partitions with the same weight.
  - 16. The system according to claim 11, wherein the nodes connected by heavier edges are mapped to the same page.
  - 17. The system according to claim 10, wherein system memory is constrained.
  - 18. The system according to claim 10, further comprising an arrangement for identifying at least one ready cluster.

19. A system for clustering an XML document having at least one node, the system comprising:
- an arrangement for assigning an edge weight;
  
  an arrangement for tree partitioning; and
  
  an arrangement for page assignment;
  
  wherein the system removes XML text data of a node prior to the entire document being clustered by detecting a ready cluster and removing the ready cluster from an intermediate partition upon assignment to a page, wherein said ready cluster is a cluster which carries with it corresponding XML text that would be part of a final partition while avoiding the need to keep the entire XML document in memory until the final partition is computed;
  
  wherein the system utilizes a processor to cluster an XML document;
  
  wherein the system partitions a weight range into equal size weight intervals and associates only one partition for each weight interval;
  
  wherein given a predetermined memory limit M for managing memory usage in selecting optimal partitions, when memory usage reaches a high water mark, a corrective action is triggered to select a ready sub-partition, and when memory usage reaches a low water mark operation resumes;
  
  wherein said ready sub-partition is a highest value partition associated with a root of a processed subtree which is a subset of a computed best partition for a whole clustering tree; and
  
  wherein the XML clustering system processes the XML document, partitions the XML document into clusters, and assigns the clusters to pages all within a single pass.
- View Dependent Claims (20, 21)
- - 20. The system according to claim 19, wherein system memory is constrained.
  - 21. The system according to claim 19, further comprising an arrangement for identification of at least one ready cluster.

22. A method for clustering XML documents, the method comprising the steps of:
- parsing an XML document by node;
  
  initializing at least one parsed node;
  
  partitioning at least one parsed node, wherein a weight range is partitioned into equal size weight intervals and only one partition is associated with each weight interval;
  
  processing at least one parsed node; and
  
  wherein XML text data of a node is removed prior to the entire document being clustered by detecting a ready cluster and removing the ready cluster from an intermediate partition upon assignment to a page, wherein said ready cluster is a cluster which carries with it corresponding XML text that would be part of a final partition while avoiding the need to keep the entire XML document in memory until the final partition is computed;
  
  wherein given a predetermined memory limit M for managing memory usage in selecting optimal partitions, when memory usage reaches a high water mark, a corrective action is triggered to select a ready sub-partition, and when memory usage reaches a low water mark operation resumes;
  
  wherein said ready sub-partition is a highest value partition associated with a root of a processed subtree which is a subset of a computed best partition for a whole clustering tree; and
  
  wherein the method for clustering XML documents processes the XML document, partitions the XML document into clusters, and assigns the clusters to pages all within a single pass.
- View Dependent Claims (23, 24, 25, 26, 27, 28, 29)
- - 23. The method according to claim 22, wherein initializing the at least one parsed node comprises:
    - creating at least one tree node for at least one parsed node;
      
      providing XML workload information about at least one parsed node;
      
      providing at least one parent/child link and assigning an edge weight when a parsed node is a parent;
      
      designating a parsed node as the root of a tree when a parsed node is not a parent;
      
      creating a partition; and
      
      adding the created partition to a list of created partitions.
  - 24. The method according to claim 23, wherein providing XML workload information comprises:
    - analyzing at least one XPath query when at least one node is parsed;
      
      identifying at least one node that is visited by an XPath query; and
      
      determining the number of visits to at least one node via the parent/child edge.
  - 25. The method according to claim 22, wherein the initial partition is created with node weight w′
    - and value 0.
  - 26. The method according to claim 22, wherein partitioning at least one parsed node comprises:
    - partitioning at least one parsed node by making use of a partition of a root node and partitions of at least one child node;
      
      creating Wc=W/C partitions, wherein W is a weight bound and further wherein C is a Chunk size;
      
      selecting a partition with a maximum value among from among the Wc partitions; and
      
      deleting any child nodes of the at least one parsed node.
  - 27. The method according to claim 26, wherein each of the Wc partitions has maximum value among all partitions with the same weight.
  - 28. The method according to claim 23, wherein the nodes connected by heavier edges are mapped to the same page.
  - 29. The system according to claim 22, further comprising identifying at least one ready cluster.

30. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for clustering XML documents, said method comprising the steps of:
- parsing an XML document by node;
  
  initializing at least one parsed node;
  
  partitioning at least one parsed node, wherein a weight range is partitioned into equal size weight intervals and only one partition is associated with each weight interval;
  
  processing at least one parsed node; and
  
  wherein XML text data of a node is removed prior to the entire document being clustered by detecting a ready cluster and removing the ready cluster from an intermediate partition upon assignment to a page, wherein said ready cluster is a cluster which carries with it corresponding XML text that would be part of a final partition while avoiding the need to keep the entire XML document in memory until the final partition is computed;
  
  wherein given a predetermined memory limit M for managing memory usage in selecting optimal partitions, when memory usage reaches a high water mark, a corrective action is triggered to select a ready sub-partition, and when memory usage reaches a low water operation resumes;
  
  wherein said ready sub-partition is a highest value partition associated with a root of a processed subtree which is a subset of a computed best partition for a whole clustering tree; and
  
  wherein the method for clustering XML documents processes the XML document, partitions the XML document into clusters, and assigns the clusters to pages all within a single pass.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Shmueli, Oded, Padmanabhan, Sriram K., Bordawekar, Rajesh
Primary Examiner(s)
Mofiz; Apu M
Assistant Examiner(s)
Stace; Brent

Application Number

US10/703,250
Publication Number

US 20050102256A1
Time in Patent Office

1,971 Days
Field of Search

707/1, 707/3, 707/6, 707/101
US Class Current

1/1
CPC Class Codes

G06F 16/83 Querying

Y10S 707/99942 Manipulating data structure...

Single pass workload directed clustering of XML documents

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

Single pass workload directed clustering of XML documents

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links