Determination of member pages for a hyperlinked document with link and document analysis
First Claim
1. An automated identification methodology for assembling document related hyperlinked pages comprising:
- performing a page-level link analysis that identifies those hyperlinks on a page linking to a candidate document page potentially part of the document;
performing a recursive application of the page-level link analysis to the linked candidate document page and any further nested candidate document pages thereby identified, until a collective set of identified candidate document pages is assembled; and
, performing a document-level analysis that examines the collective set of identified candidate document pages for grouping into one or more documents.
4 Assignments
0 Petitions
Accused Products
Abstract
The present invention relates to a methodology for assembling a document from content spanning multiple web-pages employing two cooperative processes. Given a starting location, one process analyzes a single page at a time to find candidate links. The links are recursively followed and those pages are analyzed. A detailed set of heuristics is used to determine what is or is not a candidate link. The candidate pages are then fed to a document-level analyzer. This process compares the attributes of one page against the others and looks for a document-like structure. Using another detailed set of heuristics, the document-level analyzer determines if the page should be included in the document.
-
Citations
48 Claims
-
1. An automated identification methodology for assembling document related hyperlinked pages comprising:
-
performing a page-level link analysis that identifies those hyperlinks on a page linking to a candidate document page potentially part of the document;
performing a recursive application of the page-level link analysis to the linked candidate document page and any further nested candidate document pages thereby identified, until a collective set of identified candidate document pages is assembled; and
,performing a document-level analysis that examines the collective set of identified candidate document pages for grouping into one or more documents. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
-
-
26. A system identification methodology for assembling a hyperlinked document comprising:
-
performing a page-level link analysis that identifies those hyperlinks on a page linking to a candidate document page further comprising a methodology of;
identifying possible progression links, and;
identifying possible table of content links;
performing a recursive application of the page-level link analysis to the linked candidate document page and any further nested candidate document pages thereby identified, until a collective set of identified candidate document pages is assembled; and
,performing a document-level analysis that examines the collective set of identified candidate document pages for grouping into one or more documents. - View Dependent Claims (27, 28, 29, 30, 31, 32, 33, 34, 35, 36)
-
-
37. A system identification methodology for assembling a hyperlinked document comprising:
-
performing a page-level link analysis that identifies those hyperlinks on a page linking to a candidate document page further comprising a methodology of;
identifying possible progression links;
identifying possible table of content links, and;
examining the possible progression links and the possible table of content links for common characteristics;
performing a recursive application of the page-level link analysis to the linked candidate document page and any further nested candidate document pages thereby identified, until a collective set of identified candidate document pages is assembled; and
,performing a document-level analysis that examines the collective set of identified candidate document pages for grouping into one or more documents. - View Dependent Claims (38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48)
-
Specification