Method of web crawling utilizing address mapping
First Claim
1. A computer-based method of retrieving Web document information from a computer network, comprising:
- retrieving a Web document from a computer network using a first protocol included in a primary document address specification;
obtaining data from the Web document;
determining whether the primary document address specification has a corresponding secondary document address specification; and
if the primary document address specification has a corresponding secondary document address specification, retrieving supplementary data from the computer network pertaining to the Web document using a second protocol included in the secondary document address specification.
2 Assignments
0 Petitions
Accused Products
Abstract
A computer-based system and method of retrieving information pertaining to Web documents on a computer network is disclosed. The method includes maintaining an address map that associates primary addresses with secondary addresses. A primary address includes a network retrieval protocol and a network address. The secondary address may include a different retrieval protocol or a different network address from the primary document address. A Web crawler retrieves a Web document using the primary document address, and determines whether the address map contains a secondary document address prefix corresponding to the primary document address prefix. If a secondary document address prefix exists, the Web crawler creates a secondary address, retrieves additional information pertaining to the Web document, and combines the additional information with the data retrieved from the Web document. The combined data may be stored in an index, and subsequently used to perform a document search.
210 Citations
28 Claims
-
1. A computer-based method of retrieving Web document information from a computer network, comprising:
-
retrieving a Web document from a computer network using a first protocol included in a primary document address specification; obtaining data from the Web document; determining whether the primary document address specification has a corresponding secondary document address specification; and if the primary document address specification has a corresponding secondary document address specification, retrieving supplementary data from the computer network pertaining to the Web document using a second protocol included in the secondary document address specification. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A computer-based method of retrieving information from a computer network during a network crawl, comprising:
-
retrieving an electronic document from the computer network, the electronic document including at least one hyperlink specification including a primary document address specification; retrieving at least one primary document address specification from the electronic document using a first protocol included in the primary address specification, each primary document address corresponding to a linked electronic document; determining whether the primary document address specification has a corresponding secondary document address specification; if the primary document address specification has a corresponding secondary document address specification, retrieving supplementary data pertaining to the linked electronic document from the computer network using a second protocol included in the secondary document address specification; and if the primary document address specification has a corresponding secondary document address specification, storing the supplementary data pertaining to the linked electronic document obtained using the secondary document address specification and associating the stored supplementary data pertaining to the linked electronic document with the primary document address specification. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22)
-
-
23. A system for performing a Web crawl, the system comprising:
-
a server computer having a Web crawler program executing thereon; an address map accessible to the Web crawler program and containing a plurality of primary Web addresses and a plurality of secondary Web addresses, each primary Web address having a corresponding secondary Web address; the primary Web address including a first protocol for the retrieval of a Web document at the primary Web address; the secondary Web address including a second protocol for the retrieval of a Web document at the secondary Web address; the second protocol for the retrieval of a Web document at the secondary Web address being different than the first protocol for the retrieval of a Web document at the primary Web address; a computer network including at least one Web server having a plurality of Web documents stored thereon, each Web document having a corresponding primary Web address; a database containing information pertaining to the plurality of Web documents; program code for; retrieving a primary Web address corresponding to one of the Web documents; determining whether the primary Web address has a corresponding secondary Web address; selectively retrieving supplementary information pertaining to said one of the Web documents using the corresponding secondary Web address; and if the supplementary information pertaining to said one of the Web documents is retrieved using the secondary Web address, storing said supplementary information in the database. - View Dependent Claims (24, 25, 26, 27, 28)
-
Specification