System, method, and computer program product for identifying multi-page documents in hypertext collections
First Claim
1. A method for improving information retrieval, classification, indexing, and summarization, comprising:
- identifying a compound document as a coherent body of hyperlinked material on a single topic as created by a number of collaborating authors;
analyzing the content and structure of the compound document to find a preferred entry point for the compound document;
processing the compound document as a whole, including at least one of indexing, classification, and retrieval; and
processing the compound document from the entry point, including at least one of creating at least one of presentation of results from retrieval, summarization, and classification.
1 Assignment
0 Petitions
Accused Products
Abstract
A system, method, and computer program product for identifying compound documents as a coherent body of hyperlinked material on a single topic as created by an author or collaborating authors, analyzing the content and structure of the compound documents and related hyperlinks, and responsively selecting a preferred entry point at which to begin processing such documents. The body of material may include the internet, an intranet, or other digital library that typically has content distributed over several separate pages or URLs, sometimes in a hierarchical directory structure. The processing may include creating at least one taxonomy, as well as searching or indexing the compound documents. The identification and analysis schemes include a observation of a number of heuristics run on component documents in the compound documents.
73 Citations
21 Claims
-
1. A method for improving information retrieval, classification, indexing, and summarization, comprising:
-
identifying a compound document as a coherent body of hyperlinked material on a single topic as created by a number of collaborating authors;
analyzing the content and structure of the compound document to find a preferred entry point for the compound document;
processing the compound document as a whole, including at least one of indexing, classification, and retrieval; and
processing the compound document from the entry point, including at least one of creating at least one of presentation of results from retrieval, summarization, and classification. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A system for improving information retrieval, indexing, and summarization comprising:
-
a compound document identifier that detects a coherent body of hyperlinked material on a single topic as created by a number of collaborating authors;
an analyzer that finds a preferred entry point for the compound document according to the content and structure of the compound document; and
a compound document processor that performs at least one of indexing, classification, and retrieval, for the compound document as a whole, and then performs at least one of creating at least one presentation of results from retrieval, summarization, and classification.
-
-
21. A computer program product instantiated on a computer-readable medium, comprising:
-
a first code means for identifying a compound document as a coherent body of hyperlinked material on a single topic as created by a number of collaborating authors;
a second code means for analyzing the content and structure of the compound document to find a preferred entry point for the compound document; and
a third code means for processing the compound document, wherein the processing includes at least one of creating at least one presentation of results from retrieval, summarization, and classification.
-
Specification