System and method for efficient filtering of data set addresses in a web crawler
First Claim
1. A method of downloading data sets from among a plurality of host computers, comprising the steps of:
- (a) storing representations of data set addresses in a set of data structures, including a buffer and a first disk file, wherein the representations of data set addresses stored in the first disk file are ordered;
(b) downloading at least one data set that includes addresses of one or more referred data sets;
(c) identifying the addresses of the one or more referred data sets;
(d) for each identified address;
(d1) generating a representation of the identified address;
(d2) determining whether the representation is stored in the buffer without determining whether the representation is stored in the first disk file, and when this determination is negative, storing the representation in the buffer; and
(e) when the buffer reaches a predefined full condition;
(e1) ordering the contents of the buffer according to the representations;
(e2) performing an ordered merge of the contents of the buffer into the contents of the first disk file; and
(e3) preventing duplication of any of the representations of data set addresses stored in the first disk file after the ordered merge.
5 Assignments
0 Petitions
Accused Products
Abstract
A web crawler stores fixed length representations of document addresses in a buffer and a disk file, and optionally in a cache. When the web crawler downloads a document from a host computer, it identifies URL'"'"'s (document addresses) in the downloaded document. Each identified URL is converted into a fixed size numerical representation. The numerical representation may optionally be systematically compared to the contents of a cache containing web sites which are likely to be found during the web crawl, for example previously visited web sites. The numerical representation is then systematically compared to numerical representations in the buffer, which stores numerical representations of recently-identified URL'"'"'s. If the representation is not found in the buffer, it is stored in the buffer. When the buffer is full, it is ordered and then merged with numerical representations stored, in order, in the disk file. In addition, the document corresponding to each representation not found in the disk file during the merge is scheduled for downloading. The disk file may be a sparse file, indexed to correspond to the numerical representations of the URL'"'"'s, so that only a relatively small fraction of the disk file must be searched and re-written in order to merge each numerical representation in the buffer.
100 Citations
64 Claims
-
1. A method of downloading data sets from among a plurality of host computers, comprising the steps of:
-
(a) storing representations of data set addresses in a set of data structures, including a buffer and a first disk file, wherein the representations of data set addresses stored in the first disk file are ordered; (b) downloading at least one data set that includes addresses of one or more referred data sets; (c) identifying the addresses of the one or more referred data sets; (d) for each identified address; (d1) generating a representation of the identified address; (d2) determining whether the representation is stored in the buffer without determining whether the representation is stored in the first disk file, and when this determination is negative, storing the representation in the buffer; and (e) when the buffer reaches a predefined full condition; (e1) ordering the contents of the buffer according to the representations; (e2) performing an ordered merge of the contents of the buffer into the contents of the first disk file; and (e3) preventing duplication of any of the representations of data set addresses stored in the first disk file after the ordered merge. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A method of downloading data sets from among a plurality of host computers, comprising the steps of:
-
(a) storing representations of data set addresses in a set of data structures, including a first buffer, a second buffer, and a first disk file, wherein the first disk file contains ordered representations of data set addresses; (b) selecting as a current buffer one of the first and second buffers; (c) downloading at least one data set that includes addresses of one or more referred data sets; (d) identifying the addresses of the one or more referred data sets; and (e) for each identified address; (e1) generating a representation of the identified address; and (e2) determining whether the representation is stored in the current buffer without determining whether the representation is stored in the first disk file, and when this determination is negative, storing the representation in the current buffer; and (f) when the current buffer reaches a predefined full condition; (f1) selecting the other buffer as the current buffer, wherein the previously current buffer is identified as a non-current buffer; (f2) ordering representations stored in the non-current buffer; and (f3) performing an ordered merge of the contents of the non-current buffer into the contents of the first disk file wherein the ordered merge comprises preventing duplication of any of the representations of data set addresses stored in the first disk file during or after merging. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21)
-
-
22. A method of downloading data sets from among a plurality of host computers, comprising the steps of:
-
(a) storing representations of data set addresses in a set of data structures, including a buffer and a disk file, wherein representations of data set addresses stored in the disk file are ordered; (b) downloading at least one data set that includes an address of a referred data set; (c) identifying the address of the referred data set; (d) generating a representation of the identified address; (e) determining whether the representation is stored in the buffer, and whether the disk file is empty; (f) when the representation is not stored in the buffer and the disk file is empty, scheduling the corresponding data set for downloading; (g) when the representation is not stored in the buffer and the disk file is not empty, storing the representation in the buffer and delaying scheduling of the corresponding data set for downloading until a condition occurs; and (h) when it is determined that the condition has occurred, performing an ordered merge of contents of the buffer into contents of the first disk file wherein the ordered merge comprises preventing duplication of any of the representations of data set addresses stored in the first disk file during or after merging the contents of the buffer into the contents of the first disk file.
-
-
23. A computer program product for use in conjunction with a computer system, the computer program product comprising a computer readable storage medium and a computer program mechanism embedded therein, the computer program mechanism comprising:
-
a first disk file and a buffer, for storing representations of data set addresses; a main web crawler module for downloading and processing data sets stored on a plurality of host computers, the main web crawler module identifying addresses of one or more referred data sets in the downloaded data sets; and an address filtering module for processing a specified one of the identified addresses; the address filtering module including instructions for; generating a representation of the identified address; determining whether the representation is stored in the buffer without determining whether the representation is stored in the first disk file, and when this determination is negative storing the representation in the buffer; and determining whether the buffer has reached a predefined full condition, and when this determination is positive, ordering the contents of the buffer and then performing an ordered merge of contents of the buffer into the contents of the first disk file wherein the ordered merge comprises preventing duplication of any of the representations of data set addresses stored in the first disk file during or after merging the contents of the buffer into the contents of the first disk file. - View Dependent Claims (24, 25, 26, 27, 28, 29, 30)
-
-
31. A computer program product for use in conjunction with a computer system, the computer program product comprising a computer readable storage medium and a computer program mechanism embedded therein, the computer program mechanism comprising:
-
a first disk file, a first buffer, and a second buffer, for storing representations of data set addresses; a main web crawler module for downloading and processing data sets stored on a plurality of host computers, the main web crawler module identifying addresses of the one or more referred data sets in the downloaded data sets; and an address filtering module for processing a specified one of the identified addresses;
the address filtering module including instructions for;identifying one of the first and second buffers as a current buffer; generating a representation of the identified address; determining whether the representation is stored in the current buffer without determining whether the representation is stored in the first disk file, and when this determination is negative, storing the representation in the current buffer; and determining whether the current buffer has reached a predefined full condition, and when this determination is positive, selecting the other buffer as the current buffer, wherein the previously current buffer is identified as a non-current buffer, ordering the contents of the non-current buffer and then performing an ordered merge of the contents of the non-current buffer into the contents of the first disk file wherein the ordered merge comprises preventing duplication of any of the representations of data set addresses stored in the first disk file during or after merging the contents of the buffer into the contents of the first disk file. - View Dependent Claims (32, 33, 34, 35, 36, 37)
-
-
38. A web crawler for downloading data set addresses from among a plurality of host computers, comprising:
-
a first disk file and a buffer, for storing representations of data set addresses; a main web crawler module for downloading and processing data sets stored on a plurality of host computers, the main web crawler module identifying addresses of the one or more referred data sets in the downloaded data sets; and an address filtering module for processing a specified one of the identified addresses;
the address filtering module including instructions for;generating a representation of the identified address; determining whether the representation is stored in the buffer without determining whether the representation is stored in the first disk file, and when this determination is negative storing the representation in the buffer; and determining whether the buffer has reached a predefined full condition, and when this determination is positive, ordering the contents of the buffer and then performing an ordered merge of the contents of the buffer into the contents of the first disk file wherein the ordered merge comprises preventing duplication of any of the representations of data set addresses stored in the first disk file during or after merging the contents of the buffer into the contents of the first disk file. - View Dependent Claims (39, 40, 41, 42, 43, 44, 45)
-
-
46. A web crawler for downloading data set addresses from among a plurality of host computers, comprising:
-
a first disk file, a first buffer and a second buffer, for storing representations of data set addresses; a main web crawler module for downloading and processing data sets stored on a plurality of host computers, the main web crawler module identifying addresses of the one or more referred data sets in the downloaded data sets; and an address filtering module for processing a specified one of the identified addresses;
the address filtering module including instructions for;identifying one of the first and second buffers as a current buffer; generating a representation of the identified address; determining whether the representation is stored in the current buffer without determining whether the representation is stored in the first disk file, and when this determination is negative, storing the representation in the current buffer; and determining whether the current buffer has reached a predefined full condition, and when this determination is positive, selecting the other buffer as the current buffer, wherein the previously current buffer is identified as a non-current buffer, ordering the contents of the non-current buffer and then performing an ordered merge of the contents of the non-current buffer into the contents of the first disk file wherein the ordered merge comprises preventing duplication of any of the representations of data set addresses stored in the first disk file during or after merging the contents of the buffer into the contents of the first disk file. - View Dependent Claims (47, 48, 49, 50, 51, 52)
-
-
53. A method of downloading data sets from among a plurality of host computers, comprising the steps of:
-
(a) storing representations of data set addresses in a set of data structures, including a buffer and a first disk file, wherein the representations of data set addresses stored in the first disk file are ordered; (b) downloading at least one data set that includes addresses of one or more referred data sets; (c) identifying the addresses of the one or more referred data sets; (d) for each identified address; (d1) generating a representation of the identified address; (d2) determining whether the representation is stored in the buffer without determining whether the representation is stored in the first disk file, and when this determination is negative, storing the representation in the buffer; and (e) when the buffer reaches a predefined full condition; (e1) ordering the contents of the buffer according to the representations; (e2) performing an ordered merge of the contents of the buffer into the contents of the first disk file; and (e3) preventing duplication of any of the representations of data set addresses stored in the first disk file during the ordered merge. - View Dependent Claims (54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64)
-
Specification