Method of web crawling utilizing address mapping

US 6,145,003 A
Filed: 12/17/1997
Issued: 11/07/2000
Est. Priority Date: 12/17/1997
Status: Expired due to Term

First Claim

Patent Images

1. A computer-based method of retrieving Web document information from a computer network, comprising:

retrieving a Web document from a computer network using a first protocol included in a primary document address specification;

obtaining data from the Web document;

determining whether the primary document address specification has a corresponding secondary document address specification; and

if the primary document address specification has a corresponding secondary document address specification, retrieving supplementary data from the computer network pertaining to the Web document using a second protocol included in the secondary document address specification.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer-based system and method of retrieving information pertaining to Web documents on a computer network is disclosed. The method includes maintaining an address map that associates primary addresses with secondary addresses. A primary address includes a network retrieval protocol and a network address. The secondary address may include a different retrieval protocol or a different network address from the primary document address. A Web crawler retrieves a Web document using the primary document address, and determines whether the address map contains a secondary document address prefix corresponding to the primary document address prefix. If a secondary document address prefix exists, the Web crawler creates a secondary address, retrieves additional information pertaining to the Web document, and combines the additional information with the data retrieved from the Web document. The combined data may be stored in an index, and subsequently used to perform a document search.

210 Citations

28 Claims

1. A computer-based method of retrieving Web document information from a computer network, comprising:
- retrieving a Web document from a computer network using a first protocol included in a primary document address specification;
  
  obtaining data from the Web document;
  
  determining whether the primary document address specification has a corresponding secondary document address specification; and
  
  if the primary document address specification has a corresponding secondary document address specification, retrieving supplementary data from the computer network pertaining to the Web document using a second protocol included in the secondary document address specification.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method of claim 1, wherein the primary document address specification includes the first protocol and a first network address, and the secondary document address specification includes the second protocol and a second network address, and wherein the first protocol is different from the second protocol.
  - 3. The method of claim 2, wherein the first protocol is HTTP and the second protocol is FILE.
  - 4. The method of claim 3, further comprising retrieving an access control list corresponding to the Web document by using the secondary document address specification.
  - 5. The method of claim 4, further comprising storing the supplementary data pertaining to the Web document retrieved using the second protocol from the secondary document address specification with the data retrieved from the Web document using the first protocol from the primary document address specification in a document index.
  - 6. The method of claim 1, wherein the primary document address specification includes the first protocol and a first network address, and the secondary document address specification includes the second protocol and a second network address, and the first network address is different from the second network address.
  - 7. The method of claim 6, wherein retrieving supplementary data pertaining to the Web document by using the secondary document address specification comprises retrieving a second Web document that includes supplementary data pertaining to the Web document.
  - 8. The method of claim 7, wherein retrieving the second Web document includes using the hypertext transfer protocol (HTTP) to retrieve the second Web document.
  - 9. The method of claim 7, wherein retrieving the second Web document includes using a database specification to retrieve the second Web document.
  - 10. The method of claim 1, wherein determining whether the primary document address specification has a corresponding secondary document address specification includes determining whether an entry corresponding to the primary document address specification exists in an address map.
  - 11. The method of claim 10, wherein the entry corresponding to the primary document address specification includes a transfer protocol specification and a top level domain specification.
  - 12. The method of claim 1, further comprising:
    - determining whether the primary document address specification has a corresponding tertiary document address specification;
      
      if the primary document address specification has a corresponding tertiary document address specification, retrieving further data pertaining to the Web document by using the tertiary document specification; and
      
      if the primary document address specification has a corresponding tertiary document address specification, storing the further data pertaining to the Web document obtained using the tertiary document address specification with the data obtained from the Web document.
  - 13. The method of claim 1, wherein the secondary document address specification is automatically built by replacing a secondary address prefix for a primary address prefix in the primary document address specification.
  - 14. The method of claim 13, further comprising:
    - (a) obtaining a URL from a transaction log;
      
      (b) parsing the URL into a URL prefix and URL suffix;
      
      (c) providing an address map containing a plurality of primary address prefixes and corresponding secondary address prefixes;
      
      (d) determining if the URL prefix is included in the address map as a primary address prefix;
      
      (i) if the URL prefix is included in the address map as a primary address prefix, combining a secondary address prefix that corresponds to the primary address prefix with the URL suffix to build the secondary document address specification; and
      
      (ii) if the URL prefix is not included in the address map as a primary address prefix, changing the parsing of the URL to incrementally reduce the URL prefix and increase the URL suffix and then repeating this paragraph (d).

15. A computer-based method of retrieving information from a computer network during a network crawl, comprising:
- retrieving an electronic document from the computer network, the electronic document including at least one hyperlink specification including a primary document address specification;
  
  retrieving at least one primary document address specification from the electronic document using a first protocol included in the primary address specification, each primary document address corresponding to a linked electronic document;
  
  determining whether the primary document address specification has a corresponding secondary document address specification;
  
  if the primary document address specification has a corresponding secondary document address specification, retrieving supplementary data pertaining to the linked electronic document from the computer network using a second protocol included in the secondary document address specification; and
  
  if the primary document address specification has a corresponding secondary document address specification, storing the supplementary data pertaining to the linked electronic document obtained using the secondary document address specification and associating the stored supplementary data pertaining to the linked electronic document with the primary document address specification.
- View Dependent Claims (16, 17, 18, 19, 20, 21, 22)
- - 16. The method of claim 15, wherein the primary document address specification includes the first protocol and a first network address, and the secondary document address specification includes the second protocol and a second network address, and wherein the first protocol is different from the second protocol.
  - 17. The method of claim 15, wherein the primary document address specification includes the first protocol and a first network address, and the secondary document address specification includes the second protocol and a second network address, and the first network address is different from the second network address.
  - 18. The method of claim 15, wherein determining whether the primary document address specification has a corresponding secondary document address specification includes determining whether an entry corresponding to the primary document address specification exists in an address map.
  - 19. The method of claim 15, further comprising:
    - retrieving data from the linked electronic document using the primary document address specification;
      
      storing the data retrieved from the linked electronic document using the primary document address specification; and
      
      associating the data retrieved from the linked electronic document using the primary document address specification with the supplementary data pertaining to the linked electronic document retrieved using the secondary document address specification.
  - 20. The method of claim 15, further comprising:
    - automatically retrieving a plurality of primary document address specifications from a plurality of hyperlinks included in the electronic document;
      
      automatically retrieving a plurality of secondary document address specifications corresponding to said plurality of primary document address specifications; and
      
      automatically retrieving supplementary data pertaining to the linked electronic document using said plurality of secondary document address specifications.
  - 21. The method of claim 15, wherein the secondary address specification is automatically built by replacing a secondary address prefix for a primary address prefix in the primary document address specification.
  - 22. The method of claim 21, further comprising:
    - (a) obtaining a URL from a transaction log;
      
      (b) parsing the URL into a URL prefix and URL suffix;
      
      (c) providing an address map containing a plurality of primary address prefixes and corresponding secondary address prefixes;
      
      (d) determining if the URL prefix is included in the address map as a primary address prefix;
      
      (i) if the URL prefix is included in the address map as a primary address prefix, combining a secondary address prefix that corresponds to the primary address prefix with the URL suffix to build the secondary address specification; and
      
      (ii) if the URL prefix is not included in the address map as a primary address prefix, changing the parsing of the URL to incrementally reduce the URL prefix and increase the URL suffix and then repeating this paragraph (d).

23. A system for performing a Web crawl, the system comprising:
- a server computer having a Web crawler program executing thereon;
  
  an address map accessible to the Web crawler program and containing a plurality of primary Web addresses and a plurality of secondary Web addresses, each primary Web address having a corresponding secondary Web address;
  
  the primary Web address including a first protocol for the retrieval of a Web document at the primary Web address;
  
  the secondary Web address including a second protocol for the retrieval of a Web document at the secondary Web address;
  
  the second protocol for the retrieval of a Web document at the secondary Web address being different than the first protocol for the retrieval of a Web document at the primary Web address;
  
  a computer network including at least one Web server having a plurality of Web documents stored thereon, each Web document having a corresponding primary Web address;
  
  a database containing information pertaining to the plurality of Web documents;
  
  program code for;
  
  retrieving a primary Web address corresponding to one of the Web documents;
  
  determining whether the primary Web address has a corresponding secondary Web address;
  
  selectively retrieving supplementary information pertaining to said one of the Web documents using the corresponding secondary Web address; and
  
  if the supplementary information pertaining to said one of the Web documents is retrieved using the secondary Web address, storing said supplementary information in the database.
- View Dependent Claims (24, 25, 26, 27, 28)
- - 24. The system of claim 23, further comprising a search engine that performs a Web search using the database.
  - 25. The system of claim 24, wherein the first protocol is HTTP, the second protocol is FILE, and the retrieving of supplementary information pertaining to said one of the Web documents using the secondary Web address includes using file system commands to retrieve the supplementary information pertaining to said one of the Web documents.
  - 26. The system of claim 25, wherein the supplementary information pertaining to said one of the Web documents includes an access control list.
  - 27. The system of claim 23, wherein each primary Web address includes a data transfer protocol specification and a top level domain specification, and each secondary Web address includes a data transfer protocol specification and a top level domain specification.
  - 28. The system of claim 23, wherein the address map and said plurality of Web documents reside on different computers.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Meyerzon, Dmitriy, Sanu, Sankrant
Primary Examiner(s)
Geckil, Mehmet B.

Application Number

US08/992,329
Time in Patent Office

1,056 Days
Field of Search

709/224, 709/218, 709/203, 709/250, 709/217, 709/225, 709/229, 709/236, 707/10, 707/3, 707/4, 707/103, 707/2, 707/104, 707/9, 707/5, 707/500, 707/501, 707/513, 707/522, 395/327
US Class Current

709/225
CPC Class Codes

H04L 61/00   Network arrangements, proto...

H04L 61/45   Network directories; Name-t...

H04L 67/02   based on web technology, e....

H04L 67/10   in which an application is ...

H04L 69/329   in the application layer [O...

Y10S 707/99933   Query processing, i.e. sear...

Method of web crawling utilizing address mapping

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

210 Citations

28 Claims

Specification

Solutions

Use Cases

Quick Links

Method of web crawling utilizing address mapping

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

210 Citations

28 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links