System and method for efficient representation of data set addresses in a web crawler
First Claim
1. A method of downloading data sets from among a plurality of host computers, comprising the steps of:
- (a) storing representations of data set addresses in a set of data structures, including a first cache, a second cache, and a disk file;
(b) downloading at least one data set that includes addresses of one or more referred data sets;
(c) identifying the addresses of the one or more referred data sets, and (d) for each identified address;
(d1) generating a fixed-length representation of the identified address;
(d2) determining first whether the representation of the identified address is stored in the first cache, and when the first determination is negative determining second whether the representation of the identified address is stored in the second cache, and when the second determination is negative determining third whether the representation of the identified address is stored in the disk file;
(d3) when the third determination is negative, storing the representation of the identified address in the second cache and scheduling the corresponding data set for downloading; and
(d4) when the third determination is positive, storing the representation of the identified address in the first cache.
7 Assignments
0 Petitions
Accused Products
Abstract
A web crawler stores fixed length representations of document addresses in first and second caches and a disk file. When the web crawler downloads a document from a host computer, it identifies URL'"'"'s (document addresses) in the downloaded document. Each identified URL is converted into a fixed size numerical representation. The numerical representation is systematically compared to numerical representations in the caches and disk file. If the representation is not found in the caches and disk file, the document corresponding to the representation is scheduled for downloading, and the representation is stored in the second cache. If the representation is not found in the caches but is found in the disk file, the representation is added to the first cache. When the second cache is full, it is merged with the disk file and the second cache is reset to an initial state. When the first cache is full, one or more representations are evicted in accordance with an eviction policy. The representations include a prefix that is a function of a host component of the corresponding URL'"'"'s, and the representations are stored in the disk file in sorted order. When the web crawler searches for a representation in the disk file, an index of the disk file is searched to identify a single block of the disk file, and only that single block of the disk file is searched for the representation.
-
Citations
22 Claims
-
1. A method of downloading data sets from among a plurality of host computers, comprising the steps of:
-
(a) storing representations of data set addresses in a set of data structures, including a first cache, a second cache, and a disk file;
(b) downloading at least one data set that includes addresses of one or more referred data sets;
(c) identifying the addresses of the one or more referred data sets, and (d) for each identified address;
(d1) generating a fixed-length representation of the identified address;
(d2) determining first whether the representation of the identified address is stored in the first cache, and when the first determination is negative determining second whether the representation of the identified address is stored in the second cache, and when the second determination is negative determining third whether the representation of the identified address is stored in the disk file;
(d3) when the third determination is negative, storing the representation of the identified address in the second cache and scheduling the corresponding data set for downloading; and
(d4) when the third determination is positive, storing the representation of the identified address in the first cache. - View Dependent Claims (2, 3, 4, 5, 6, 7, 22)
when the first cache reaches a predefined full condition, one or more data set address representations in the first cache are evicted in accordance with a predefined eviction policy. -
3. The method of claim 1, wherein
when the second cache reaches a predefined full condition, the data set address representations in the second cache are merged into the data set address representations in the disk file, and the second cache is reset to a predefined initial state. -
4. The method of claim 1, wherein
the disk file in which data set address presentations are stored comprises a sequence of disk blocks; -
the data set address representations in the disk file are stored in a predefined sorted order;
step (a) includes generating a disk file index, distinct from said set of data structures, that stores information corresponding to a first data set address representation in each of the disk blocks of the disk file; and
the step of determining whether the representation of the identified address is stored in the disk file includes searching the disk file index to identify a single disk block of the disk file to search.
-
-
5. The method of claim 1, wherein
step (d1) includes generating a first fingerprint of only a host address portion of the identified address, and concatenating the first and second fingerprints to form the fixed-length representation of the identified address; the data set address representations in the disk file each comprise a concatenation of a first fingerprint of only a host address portion of the data set address associated with the data set address representation and a second fingerprint of the data set address, and the data set representations are stored in the disk file in an order corresponding to numeric values of the data address representations.
-
6. The method of claim 5, wherein the data sets include web pages and the data set addresses include uniform resource locators.
-
7. The method of claim 1, wherein said step (d1) includes
(i) obtaining a first representation portion based solely on a host component of said identified address; -
(ii) obtaining a second representation portion based on said identified address; and
(iii) combining said first and second representation portions.
-
-
22. The method of claim 7, wherein
the data set address representations in the disk file each comprise a concatenation of a first representation portion of only a host component of the data set address associated with the data set address representation and a second representation portion based on the data set address, and the data set representations are stored in the disk file in an order corresponding to numeric values of the data address representations.
-
-
8. A computer program product for use in conjunction with a computer system, the computer program product comprising a computer readable storage medium and a computer program mechanism embedded therein, the computer program mechanism comprising:
-
a disk file, a first cache and a second cache, for storing representations of data set addresses;
a main web crawler module for downloading and processing data sets stored on a plurality of host computers, the main web crawler module identifying addresses of the one or more referred data sets in the downloaded data sets, and an address processing module for processing a specified one of the identified addresses;
the address processing module including instructions for;
generating a fixed-length representation of the identified address;
determining first whether the representation of the identified address is stored in the first cache, and when the first determination is negative determining second whether the representation of the identified address is stored in the second cache, and when the second determination is negative determining third whether the representation of the identified address is stored in the disk file;
when the third determination is negative, storing the representation of the identified address in the second cache and scheduling the corresponding data set for downloading; and
when the third determination is positive, storing the representation of the identified address in the first cache. - View Dependent Claims (9, 10, 11, 12, 13, 14)
the disk file in which data set address representations are stored comprises a sequence of disk blocks; the data set address representations in the disk file are stored in a predefined sorted order;
the address processing module includes instructions for generating a disk file index, distinct from said set of data structures, that stores information corresponding to a first data set address representation in each of the disk blocks of the disk file; and
the address processing module includes instructions for searching the disk file index to identify a single disk block of the disk file to search for the identified address.
-
-
12. The computer program product of claim 8, wherein the address processing module includes instructions for generating a first fingerprint of only a host address portion of the identified address, generating a second fingerprint of the identified address, and concatenating the first and second fingerprints to form the fixed-length representation of the identified address;
- and
the data set address representations in the disk file each comprise a concatenation of a first fingerprint of only a host address portion of the data set address associated with the data set address representation and a second fingerprint of the data set address, and the data set representations are stored in the disk file in an order corresponding to numeric values of the data address representations.
- and
-
13. The computer program product of claim 12, wherein the data sets include web pages and the data set addresses include uniform resource locators.
-
14. The computer program product of claim 8, wherein the address processing module includes instructions for
(i) obtaining a first representation portion based solely on a host component of said identified address; -
(ii) obtaining a second representation portion based on said identified address; and
(iii) combining said first and second representation portions.
-
-
15. A web crawler for downloading data set addresses from among a plurality of host computers, comprising:
-
a disk file, a first cache and a second cache, for storing representations of data set addresses;
a main web crawler module for downloading and processing data sets stored on a plurality of host computers, the main web crawler module identifying addresses of the one or more referred data sets in the downloaded data sets, and an address processing module for processing a specified one of the identified addresses;
the address processing module including instructions for;
generating a fixed-length representation of the identified address;
determining first whether the representation of the identified address is stored in the first cache, and when the first determination is negative determining second whether the representation of the identified address is stored in the second cache, and when the second determination is negative determining third whether the representation of the identified address is stored in the disk file;
when the third determination is negative, storing the representation of the identified address in the second cache and scheduling the corresponding data set for downloading; and
when the third determination is positive, storing the representation of the identified address in the first cache. - View Dependent Claims (16, 17, 18, 19, 20, 21)
the disk file in which data set address representations are stored comprises a sequence of disk blocks; the data set address representations in the disk file are stored in a predefined sorted order;
the address processing module includes instructions for generating a disk file index, distinct from said set of data structures, that stores information corresponding to a first data set address representation in each of the disk blocks of the disk file; and
the address processing module includes instructions for searching the disk file index to identify a single disk block of the disk file to search for the identified address.
-
-
19. The web crawler of claim 15, wherein the address processing module includes instructions for generating a first fingerprint of only a host address portion of the identified address, generating a second fingerprint of the identified address, and concatenating the first and second fingerprints to form the fixed-length representation of the identified address;
- and
the data set address representations in the disk file each comprise a concatenation of a first fingerprint of only a host address portion of the data set address associated with the data set address representation and a second fingerprint of the data set address, and the data set representations are stored in the disk file in an order corresponding to numeric values of the data address representations.
- and
-
20. The web crawler of claim 19, wherein the data sets include web pages and the data set addresses include uniform resource locators.
-
21. The web crawler of claim 15, wherein the address processing module includes instructions for
(i) obtaining a first representation portion based solely on a host component of said identified address; -
(ii) obtaining a second representation portion based on said identified address; and
(iii) combining said first and second representation portions.
-
Specification