APPARATUS, SYSTEM, AND METHOD FOR EFFICIENT CONTENT INDEXING OF STREAMING XML DOCUMENT CONTENT
First Claim
1. A method for content indexing of streaming hierarchical document content, the method:
- comprising;
generating a hierarchical pattern forest from a set of structured index path expressions, the hierarchical pattern forest comprising at least one of a tree and a twig generated from one or more structured index path expressions uniquely associated with a namespace indicator for a hierarchical node from a streaming hierarchical document;
comparing the hierarchical node to nodes of the hierarchical pattern forest;
matching the hierarchical node with an index node in one of a tree and a twig of the hierarchical pattern forest, the index node having a path from an ancestor node to the index node that matches the axis steps of at least one of the structured index path expressions for the namespace indicator; and
storing an index entry for the hierarchical node in response to the determined match, the index entry comprising one or more of a hierarchical document identifier, an hierarchical node name, a namespace indicator for the hierarchical node, and hierarchical node content.
1 Assignment
0 Petitions
Accused Products
Abstract
An apparatus, system, and method are disclosed for efficient content indexing of streaming XML document content. A forest generator generates an XML pattern forest from a set of structured index path expressions, the XML pattern forest includes trees and twigs generated from structured index path expressions uniquely associated with a namespace indicator for an XML node. The XML node is identified in a stream of at least one XML document. A comparison module compares the XML node to nodes of trees and twigs of the XML pattern forest. A determination module determines a match between the XML node and an index node in one of a tree and a twig of the XML pattern forest. The index node has a path from an ancestor node to the index node that matches the axis steps of at least one of the structured index path expressions. A storage module stores an index entry for the XML node in response to the determined match, the index entry includes a XML document identifier, an XML node name, a namespace indicator for the XML node, and XML node content.
57 Citations
20 Claims
-
1. A method for content indexing of streaming hierarchical document content, the method:
- comprising;
generating a hierarchical pattern forest from a set of structured index path expressions, the hierarchical pattern forest comprising at least one of a tree and a twig generated from one or more structured index path expressions uniquely associated with a namespace indicator for a hierarchical node from a streaming hierarchical document; comparing the hierarchical node to nodes of the hierarchical pattern forest; matching the hierarchical node with an index node in one of a tree and a twig of the hierarchical pattern forest, the index node having a path from an ancestor node to the index node that matches the axis steps of at least one of the structured index path expressions for the namespace indicator; and storing an index entry for the hierarchical node in response to the determined match, the index entry comprising one or more of a hierarchical document identifier, an hierarchical node name, a namespace indicator for the hierarchical node, and hierarchical node content. - View Dependent Claims (2, 3, 4, 5, 6, 7)
- comprising;
-
8. A computer program product comprising a computer readable storage medium having computer usable program code executable to perform operations for content indexing of streaming Extensible Markup Language (XML) document content, the computer program product comprising:
-
a scanner configured to scan a streaming XML document in a stream of XML documents, the streaming XML document streamed in document order according to XML tree traversal protocol; an identification module configured to identify an XML node of the streaming XML document, the XML node comprising one of an XML document element node and an XML document element attribute node; a forest generator configured to generate an XML pattern forest from a set of structured index path expressions, the XML pattern forest comprising at least one of a tree and a twig generated from one or more structured index path expressions uniquely associated with a namespace indicator for the XML node; a comparison module configured to compare the XML node to nodes of the XML pattern forest; a determination module configured to determine a match between the XML node and an index node in one of a tree and a twig of the XML pattern forest, the index node having a path from an ancestor node to the index node that matches the axis steps of at least one of the structured index path expressions for the namespace indicator; and a storage module configured to store an index entry for the XML node in response to the determined match, the index entry comprising a XML document identifier, an XML node name, a namespace indicator for the XML node, and XML node content. - View Dependent Claims (9, 10, 11, 12, 13)
-
-
14. A system for content indexing of streaming Extensible Markup Language (XML) document content, the system comprising:
-
storage services module executed by a processor within a computer readable storage medium, the content storage module configured to store documents in a storage repository and to retrieve documents from the storage repository in response to XML document requests; a storage engine in communication with the storage services module, the storage engine configured to store XML documents in the storage repository intact and to generate a stream of XML documents, the stream of XML documents comprising new XML documents that are added to the storage repository and updates to existing XML documents stored in the storage repository; a document indexer configured to receive the stream of XML documents and direct the stream of XML documents to a content indexer comprising, a scanner configured to scan a streaming XML document in a stream of XML documents, the streaming XML document streamed in document order according to XML tree traversal protocol; an identification module configured to identify an XML node of the streaming XML document, the XML node comprising one of an XML document element node and an XML document element attribute node; a forest generator configured to generate an XML pattern forest from a set of structured index path expressions retrieved from an index path expression repository, the XML pattern forest comprising at least one of a tree and a twig generated from one or more structured index path expressions uniquely associated with a namespace indicator for the XML node; a comparison module configured to compare the XML node to nodes of the XML pattern forest; a determination module configured to determine a match between the XML node and an index node in one of a tree and a twig of the XML pattern forest, the index node having a path from an ancestor node to the index node that matches the axis steps of at least one of the structured index path expressions for the namespace indicator; and a storage module configured to store an index entry for the XML node in an index store in response to the determined match, the index entry comprising a XML document identifier, an XML node name, a namespace indicator for the XML node, and XML node content; and a query services module configured to receive index queries for XML documents containing content satisfying an index query and configured to return index entries that satisfy the index query. - View Dependent Claims (15, 16, 17, 18, 19)
-
-
20. A computer program product comprising a computer readable storage medium having computer usable program code executable to perform operations for content indexing of streaming Extensible Markup Language (XML) document content, the computer program product comprising:
-
a scanner configured to scan a streaming XML document in a stream of XML documents, the streaming XML document streamed in document order according to XML tree traversal protocol; an identification module configured to identify an XML node of the streaming XML document, the XML node comprising one of an XML document element node and an XML document element attribute node; a forest generator configured to generate an XML pattern forest from a set of structured index path expressions, the XML pattern forest comprising at least one of a tree and a twig generated from one or more structured index path expressions uniquely associated with a namespace indicator for the XML node, the forest generator comprising, read module configured to read the set of structured index path expressions from a repository, the set of structured index path expressions identified byway of the namespace indicator; skip module configured to ignore each index path expression having no descendent axis steps, and to ignore each index path expression having a first axis step that is different from a root node of the streaming XML document; twig generator configured to define a new twig for each index path expression having a descendent-or-self axis for a first step, the new twig comprising nodes representing the index path expression; tree generator configured to define a new tree for each index path expression having a first axis step different from a root of an existing tree in the XML pattern forest, the new tree comprising nodes representing the index path expression; and grafting module configured to identify an existing tree in the forest having a root node matching the first axis step of an index path expression and appending a branch of nodes to the existing tree, the branch of nodes corresponding to one or more axis steps of the index path expression that differ from nodes of the existing tree, the branch anchored at a node that matches a last matching axis step of the index path expression evaluated from left to right; a comparison module configured to compare the XML node to nodes of the XML pattern forest and configured to determine that the namespace indicator is different from a previously identified namespace indicator and cause the forest generator to reference a second set of structured index path expressions, the second set of structured index path expressions identified byway of the namespace indicator, the forest generator further configured to modify the XML pattern forest to include at least one of a tree and a twig representative of the second set of structured index path expressions and the set of structured index path expressions; a determination module configured to determine a match between the XML node and an index node in one of a tree and a twig of the XML pattern forest, the index node having a path from an ancestor node to the index node that matches the axis steps of at least one of the structured index path expressions of the namespace indicator; and a storage module configured to store an index entry for the XML node in response to the determined match, the index entry comprising a XML document identifier, an XML node name, a namespace indicator for the XML node, and XML node content.
-
Specification