Single pass workload directed clustering of XML documents
First Claim
1. A system for clustering XML documents, the system comprising:
- an arrangement for parsing an XML document by node;
an arrangement for initializing at least one parsed node;
an arrangement for partitioning at least one parsed node; and
an arrangement for processing at least one parsed node;
wherein the system removes XML text data of a node prior to the entire document being clustered by detecting a ready cluster and removing the ready cluster from an intermediate partition upon assignment to a page, wherein said ready cluster is a cluster which carries with it corresponding XML text that would be part of a final partition while avoiding the need to keep the entire XML document in memory until the final partition is computed;
wherein the system utilizes a processor to cluster XML documents;
wherein the system partitions a weight range into equal size weight intervals and associates only one partition for each weight interval;
wherein given a predetermined memory limit M for managing memory usage in selecting optimal partitions, when memory usage reaches a high water mark, a corrective action is triggered to select a ready sub-partition, and when memory usage reaches a low water mark operation resumes;
wherein said ready sub-partition is a highest value partition associated with a root of a processed subtree which is a subset of a computed best partition for a whole clustering tree; and
wherein the XML clustering system processes the XML document, partitions the XML document into clusters, and assigns the clusters to pages all within a single pass.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and system for clustering of XML documents is disclosed. The method operates under specified memory-use constraints. The system implements the method and scans an XML document, assigns edge-weights according to the application workload, and maps clusters of XML nodes to disk pages, all in a single parser-controlled pass over the XML data. Application workload information is used to generate XML clustering solutions that lead to substantial reduction in page faults for the workload under consideration. Several approaches for representing workload information are disclosed. For example, the workload may list the XPath operators invoked during the application along with their invocation frequencies. The application workload can be further refined by incorporating additional features such as query importance or query compilation costs. XML access patterns could be also modeled using stochastic approaches.
-
Citations
30 Claims
-
1. A system for clustering XML documents, the system comprising:
-
an arrangement for parsing an XML document by node; an arrangement for initializing at least one parsed node; an arrangement for partitioning at least one parsed node; and an arrangement for processing at least one parsed node; wherein the system removes XML text data of a node prior to the entire document being clustered by detecting a ready cluster and removing the ready cluster from an intermediate partition upon assignment to a page, wherein said ready cluster is a cluster which carries with it corresponding XML text that would be part of a final partition while avoiding the need to keep the entire XML document in memory until the final partition is computed; wherein the system utilizes a processor to cluster XML documents; wherein the system partitions a weight range into equal size weight intervals and associates only one partition for each weight interval; wherein given a predetermined memory limit M for managing memory usage in selecting optimal partitions, when memory usage reaches a high water mark, a corrective action is triggered to select a ready sub-partition, and when memory usage reaches a low water mark operation resumes; wherein said ready sub-partition is a highest value partition associated with a root of a processed subtree which is a subset of a computed best partition for a whole clustering tree; and wherein the XML clustering system processes the XML document, partitions the XML document into clusters, and assigns the clusters to pages all within a single pass. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A system for clustering XML documents, the system comprising:
-
an arrangement for parsing an XML document by node; an arrangement for determining XPath work traversals of at least one parsed node; an arrangement for clustering at least one parsed node; and an arrangement for assigning at least one cluster to a page; wherein the system removes XML text data of a node prior to the entire document being clustered by detecting a ready cluster and removing the ready cluster from an intermediate partition upon assignment to a page, wherein said ready cluster is a cluster which carries with it corresponding XML text that would be part of a final partition while avoiding the need to keep the entire XML document in memory until the final partition is computed; wherein the system utilizes a processor to cluster XML documents; wherein the system partitions a weight range into equal size weight intervals and associates only one partition for each weight interval; wherein given a predetermined memory limit M for managing memory usage in selecting optimal partitions, when memory usage reaches a high water mark, a corrective action is triggered to select a ready sub-partition, and when memory usage reaches a low water mark operation resumes; wherein said ready sub-partition is a highest value partition associated with a root of a processed subtree which is a subset of a computed best partition for a whole clustering tree; and wherein the XML clustering system processes the XML document, partitions the XML document into clusters, and assigns the clusters to pages all within a single pass. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A system for clustering an XML document having at least one node, the system comprising:
-
an arrangement for assigning an edge weight; an arrangement for tree partitioning; and an arrangement for page assignment;
wherein the system removes XML text data of a node prior to the entire document being clustered by detecting a ready cluster and removing the ready cluster from an intermediate partition upon assignment to a page, wherein said ready cluster is a cluster which carries with it corresponding XML text that would be part of a final partition while avoiding the need to keep the entire XML document in memory until the final partition is computed;wherein the system utilizes a processor to cluster an XML document; wherein the system partitions a weight range into equal size weight intervals and associates only one partition for each weight interval; wherein given a predetermined memory limit M for managing memory usage in selecting optimal partitions, when memory usage reaches a high water mark, a corrective action is triggered to select a ready sub-partition, and when memory usage reaches a low water mark operation resumes; wherein said ready sub-partition is a highest value partition associated with a root of a processed subtree which is a subset of a computed best partition for a whole clustering tree; and wherein the XML clustering system processes the XML document, partitions the XML document into clusters, and assigns the clusters to pages all within a single pass. - View Dependent Claims (20, 21)
-
-
22. A method for clustering XML documents, the method comprising the steps of:
-
parsing an XML document by node; initializing at least one parsed node; partitioning at least one parsed node, wherein a weight range is partitioned into equal size weight intervals and only one partition is associated with each weight interval; processing at least one parsed node; and wherein XML text data of a node is removed prior to the entire document being clustered by detecting a ready cluster and removing the ready cluster from an intermediate partition upon assignment to a page, wherein said ready cluster is a cluster which carries with it corresponding XML text that would be part of a final partition while avoiding the need to keep the entire XML document in memory until the final partition is computed; wherein given a predetermined memory limit M for managing memory usage in selecting optimal partitions, when memory usage reaches a high water mark, a corrective action is triggered to select a ready sub-partition, and when memory usage reaches a low water mark operation resumes; wherein said ready sub-partition is a highest value partition associated with a root of a processed subtree which is a subset of a computed best partition for a whole clustering tree; and wherein the method for clustering XML documents processes the XML document, partitions the XML document into clusters, and assigns the clusters to pages all within a single pass. - View Dependent Claims (23, 24, 25, 26, 27, 28, 29)
-
-
30. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for clustering XML documents, said method comprising the steps of:
-
parsing an XML document by node; initializing at least one parsed node; partitioning at least one parsed node, wherein a weight range is partitioned into equal size weight intervals and only one partition is associated with each weight interval;
processing at least one parsed node; andwherein XML text data of a node is removed prior to the entire document being clustered by detecting a ready cluster and removing the ready cluster from an intermediate partition upon assignment to a page, wherein said ready cluster is a cluster which carries with it corresponding XML text that would be part of a final partition while avoiding the need to keep the entire XML document in memory until the final partition is computed; wherein given a predetermined memory limit M for managing memory usage in selecting optimal partitions, when memory usage reaches a high water mark, a corrective action is triggered to select a ready sub-partition, and when memory usage reaches a low water operation resumes; wherein said ready sub-partition is a highest value partition associated with a root of a processed subtree which is a subset of a computed best partition for a whole clustering tree; and wherein the method for clustering XML documents processes the XML document, partitions the XML document into clusters, and assigns the clusters to pages all within a single pass.
-
Specification