System and method for discovering schematic structure in hypertext documents
First Claim
1. A method of reformatting at least one mark-up language document comprising the steps:
- discovering a schematic structure from said document and creating a reformatted document with a schema corresponding to said schematic structure, said steps of discovering and reformatting comprising;
receiving a document having a plurality of semantic nodes and a plurality of formatting nodes;
tokenizing said plurality of said semantic and formatting nodes;
identifying a set of keyword nodes, said set of keyword nodes comprising at least one of said plurality of semantic nodes which match an element in a set of keywords and corresponding synonyms;
labeling each element of said set of keyword nodes with a label corresponding to a matching keyword;
arranging said plurality of said semantic and formatting nodes so that each node representing an object of a similar level of abstraction are arranged as sibling nodes;
identifying a non-keyword set of nodes, said non-keyword set comprising all of said plurality of semantic and formatting nodes not an element of said set of keyword nodes;
determining for each node in said non-keyword set a corresponding first child node which is also an element of said set of keyword nodes, said corresponding first child node having a keyword label, and reformatting said document by generating a document with a schema having a keyword structure based upon re-labeling each node in said non-keyword set with a keyword label of said corresponding first child node.
3 Assignments
0 Petitions
Accused Products
Abstract
An underlying schema is discovered for an HTML document or a set of HTML documents, authored in various styles. Prior domain knowledge regarding punctuation, keywords, synonyms and HTML tags is used to 1) break a document up into separate objects, 2) identify the objects corresponding to keywords, 3) regroup objects: into hierarchical layers of abstraction, 4) logically order objects at the same level of abstraction, and 5) finally remove any non-keyword related information from the document'"'"'s discovered schematic structure. The discovered schema supports structural queries from search engines that locate data that are more semantically related to the requested information than data located by simple keyword searching.
90 Citations
23 Claims
-
1. A method of reformatting at least one mark-up language document comprising the steps:
-
discovering a schematic structure from said document and creating a reformatted document with a schema corresponding to said schematic structure, said steps of discovering and reformatting comprising;
receiving a document having a plurality of semantic nodes and a plurality of formatting nodes;
tokenizing said plurality of said semantic and formatting nodes;
identifying a set of keyword nodes, said set of keyword nodes comprising at least one of said plurality of semantic nodes which match an element in a set of keywords and corresponding synonyms;
labeling each element of said set of keyword nodes with a label corresponding to a matching keyword;
arranging said plurality of said semantic and formatting nodes so that each node representing an object of a similar level of abstraction are arranged as sibling nodes;
identifying a non-keyword set of nodes, said non-keyword set comprising all of said plurality of semantic and formatting nodes not an element of said set of keyword nodes;
determining for each node in said non-keyword set a corresponding first child node which is also an element of said set of keyword nodes, said corresponding first child node having a keyword label, and reformatting said document by generating a document with a schema having a keyword structure based upon re-labeling each node in said non-keyword set with a keyword label of said corresponding first child node.
-
-
2. A method of reformatting at least one mark-up language document, as per claim 1, wherein said at least one mark-up language document is written in HTML.
-
3. A method of reformatting at least one mark-up language document, as per claim 1, wherein said schema is expressed in XML.
-
4. A method of reformatting at least one mark-up language document, as per claim 1, wherein said method is implemented locally or remotely on one or more computer-based systems.
-
5. A method of reformatting at least one mark-up language document, as per claim 1, wherein said method is implemented across networks comprising any of LANs, WANs, cellular, Internet or Web-based networks.
-
6. A method of extracting a schematic structure from a document represented by an ordered tree having a plurality of labeled nodes comprising the steps:
-
tokenizing said plurality of labeled nodes, wherein each of a first plurality of text nodes in said tree are separated into a plurality of tokens grouped under a specified label;
identifying a second plurality of text nodes which each reference an associated keyword;
labeling each of said second plurality of text nodes with a label corresponding to said associated keyword;
creating a plurality of groups of nodes, each group comprising those nodes of said second plurality of text nodes which have a same immediate parent node;
recursively applying, to said plurality of labeled nodes of said ordered tree, a heuristic comprising the following steps;
identifying each of said plurality of labeled nodes which have child nodes, pi, labeled with a phrase tag, and for each pi identified, re-arranging a set of nodes, said re-arranging resulting in said set of nodes becoming child nodes of pi, wherein said set of nodes comprises all nodes from a next sibling of pi either
1) a next node with a label the same as pi'"'"'s label or
2) a last sibling of pi if said next node does not exist;
identifying a set of nodes, each element of said set having a label corresponding to a header tag;
re-ordering said plurality of labeled nodes so that there are no two nodes with a same label along a same path from a root of said ordered tree;
identifying a non-keyword set of nodes, said non-keyword set comprising all of said plurality of labeled nodes which are not labeled with a keyword;
identifying for each element of said non-keyword set a corresponding fist child node which is labeled with a keyword, replacing each element of said non-keyword set with said corresponding first child node, with the constraint that if at least two nodes
1) have a same label,
2) each have, before performing said step of replacing, a parent node having a same label or
3) a parent having a list tag, then maintaining a sibling relationship between them.
-
-
7. A method of extracting a schematic structure from a document represented by an ordered tree having a plurality of labeled nodes, as per claim 6, wherein said keyword comprises a base word and at least one synonym of said base word.
-
8. A method of extracting a schematic structure from a document represented by an ordered tree having a plurality of labeled nodes, as per claim 6, wherein at least one punctuation mark is used as a delimiter to identify each of said plurality of tokens.
-
9. A method of extracting a schematic structure from a document represented by an ordered tree having a plurality of labeled nodes, as per claim 8, wherein said at least one punctuation mark is at least one of “
- ,”
, “
;
”
, or “
;
”
.
- ,”
-
10. A method of extracting a schematic structure from a document represented by an ordered tree having a plurality of labeled nodes, as per claim 6, wherein said document is written in HTML.
-
11. A method of extracting a schematic structure from a document represented by an ordered tree having a plurality of labeled nodes, as per claim 10, wherein said phrase tag is at least one of h1, h2, h3, h4, h5, h6, div, p, tr, td, dt, dd, li, title, u, strong, b, em, or i.
-
12. A method of extracting a schematic structure from a document represented by an ordered tree having a plurality of labeled nodes, as per claim 10, wherein said header tag is at least one of h1, h2, h3, h4, h5, h6, hr, div, or center.
-
13. A method of extracting a schematic structure from a document represented by an ordered tree having a plurality of labeled nodes, as per claim 10, wherein said list tag is at least one of body, table, dl, ul, ol, dir, or menu.
-
14. A method of extracting a schematic structure from a document represented by an ordered tree having a plurality of labeled nodes, as per claim 6, wherein said method is implemented locally or remotely on one or more computer-based systems.
-
15. A method of extracting a schematic structure from a document represented by an ordered tree having a plurality of labeled nodes, as per claim 1, wherein said method is implemented across networks comprising any of LANs, WANs, cellular, Internet or Web-based networks.
-
16. An article of manufacture comprising a computer user medium having a computer readable program code embodied therein which implements reformatting at least one mark-up language document by discovering a schematic structure from said document and creating a reformatted document with a schema corresponding to said schematic structure, said steps of discovering and reformatting comprising:
-
receiving a document having a plurality of semantic nodes and a plurality of formatting nodes;
tokenizing said plurality of said semantic and formatting nodes;
identifying a set of keyword nodes, said set of keyword nodes comprising at least one of said plurality of semantic nodes which match an element in a set of keywords and corresponding synonyms;
labeling each element of said set of keyword nodes with a label corresponding to a matching keyword;
arranging said plurality of said semantic and formatting nodes so that each node representing an object of a similar level of abstraction are arranged as sibling nodes;
identifying a non-keyword set of nodes, said non-keyword set comprising all of said plurality of semantic and formatting nodes not an element of said set of keyword nodes;
determining for each node in said non-keyword set a corresponding first child node which is also an element of said set of keyword nodes, said corresponding first child node having a keyword label, and reformatting said document by generating a document with keyword structure based upon re-labeling each node in said non-keyword set with a keyword label of said corresponding first child node.
-
-
17. A computer-based system for extracting a schematic structure from at least one document represented by an ordered tree having a plurality of labeled nodes comprising:
-
a first I/O receiver which accepts said at least one document, tokenizes said at least one document to extract structural information, and outputs a document having a plurality of tokens;
a second I/O receiver which accepts a set of keywords and corresponding synonyms, identifies and matches keywords in nodes of said document with said plurality of tokens, and outputs a document having one or more nodes labeled with keywords;
a third I/O receiver which accepts output of said second I/O receiver and a table of mark-up language tags which comprise a plurality of specified groups, groups semantically related nodes in said document output by said second I/O receiver, and outputs a document based upon said grouping;
a heading function which receives and restructures document output from said third I/O receiver, said restructuring allowing objects under nested header tags of a similar label to be reordered as siblings of each other, and a consolidating function which receives output of said heading function and generates a document with keyword structure by replacing a non-keyword node with its first keyword child.
-
-
18. A computer-based system for extracting a schematic structure from at least one document represented by an ordered tree having a plurality of labeled nodes, as per claim 17, wherein said at least one document is written in HTML.
-
19. A computer-based system for extracting a schematic structure from at least one document represented by an ordered tree having a plurality of labeled nodes, as per claim 18, wherein said plurality of specified groups comprise phrase tags, header tags, and list tags.
-
20. A computer-based system for extracting a schematic structure from at least one document represented by an ordered tree having a plurality of labeled nodes, as per claim 17, wherein said first, second and third I/O receivers are the same.
-
21. A computer-based system for extracting a schematic structure from at least one document represented by an ordered tree having a plurality of labeled nodes, as per claim 17, wherein said tokenization step uses a set of punctuation rules to separate said plurality of labeled nodes into said plurality of tokens.
-
22. A computer-based system for extracting a schematic structure from at least one document represented by an ordered tree having a plurality of labeled nodes, as per claim 17, wherein said system is implemented locally or remotely on one or more computer-based systems.
-
23. A computer-based system for extracting a schematic structure from at least one document represented by an ordered tree having a plurality of labeled nodes, as per claim 17, wherein said system is implemented across networks comprising any of LANs, WANs, cellular, Internet or Web-based networks.
Specification