Method for downloading high-volumes of content from the internet without adversely effecting the source of the content or being detected
First Claim
1. A system for downloading a plurality of documents from a plurality of content servers, said content servers being linked to a plurality of routers that each have a different network address, said system comprising:
- a plurality of pullers;
a director for;
creating a list of URLs of the plurality of documents to be downloaded from the plurality of content servers, each of the plurality of said documents being identified by a different URL; and
assigning a portion of the list of URLs to each of the pullers such that each portion assigned to a particular puller includes all documents to be retrieved from a single content server wherein no two pullers initiate requests to adjacent URLs, wherein adjacent URLs identify documents located on the same content server;
wherein each of the plurality of pullers is responsive to the director for;
receiving the assigned portion of the list of URLs;
queuing requests to retrieve documents identified by the received portion of the list of URLs wherein the requests having different URLs are queued by the puller;
determining if the URL of a first queued request is adjacent to the URL of a document being currently downloaded;
if the URL of the first queued request is adjacent to the URL of a document being currently downloaded, waiting until the currently downloading document has been received before initiating the first queued request to avoid overlapping requests to the content server;
if the URL of the queued request is not adjacent to the URL of a document being currently downloaded, initiating the first queued request; and
a proxy gateway responsive to each of the pullers for receiving the initiated requests to retrieve documents, and for retrieving documents corresponding to the list of URL from the content servers via the routers.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and system for downloading high volumes of content from the Internet without adversely affecting the source of the content or being detected. A director server processes data received from a data source to generate a list of addresses for content to download. The director server assigns a portion of the list of addresses to one or more puller servers. Puller servers initiate requests to download content that corresponds to their assigned list of addresses. A proxy gateway server receives the download request for content, and downloads the content via various Internet Protocol addresses.
26 Citations
22 Claims
-
1. A system for downloading a plurality of documents from a plurality of content servers, said content servers being linked to a plurality of routers that each have a different network address, said system comprising:
-
a plurality of pullers; a director for; creating a list of URLs of the plurality of documents to be downloaded from the plurality of content servers, each of the plurality of said documents being identified by a different URL; and assigning a portion of the list of URLs to each of the pullers such that each portion assigned to a particular puller includes all documents to be retrieved from a single content server wherein no two pullers initiate requests to adjacent URLs, wherein adjacent URLs identify documents located on the same content server; wherein each of the plurality of pullers is responsive to the director for; receiving the assigned portion of the list of URLs; queuing requests to retrieve documents identified by the received portion of the list of URLs wherein the requests having different URLs are queued by the puller; determining if the URL of a first queued request is adjacent to the URL of a document being currently downloaded; if the URL of the first queued request is adjacent to the URL of a document being currently downloaded, waiting until the currently downloading document has been received before initiating the first queued request to avoid overlapping requests to the content server; if the URL of the queued request is not adjacent to the URL of a document being currently downloaded, initiating the first queued request; and a proxy gateway responsive to each of the pullers for receiving the initiated requests to retrieve documents, and for retrieving documents corresponding to the list of URL from the content servers via the routers. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A method for retrieving a plurality of documents from a content server, comprising:
-
creating a list of URLs corresponding to the plurality of documents to be downloaded, each of said documents being identified by a different URL; queuing requests to retrieve documents corresponding to the list of URLs from the content server wherein requests to the content server having different URLs are queued; determining if the URL of a first queued request is adjacent to the URL of a document being currently downloaded, wherein adjacent URLs identify documents located on the same content server; if the URL of the first queued request is adjacent to the URL of a document being currently downloaded, waiting until the current document has been downloaded by the content server before initiating one of the queued requests to avoid overlapping requests to the content server; if the URL of the queued request is not adjacent to the URL of a document being currently downloaded, then initiating one of the queued requests; assigning a portion of documents corresponding to the initiated requests to be retrieved via each of a plurality of network addresses; and retrieving the requested documents from the content server via the plurality of network addresses, wherein each of the network addresses appears to indicate a location from which the document requests were initiated. - View Dependent Claims (13, 14, 15, 16)
-
-
17. A computer readable storage medium having computer executable instructions for downloading a plurality of documents from a content server in response to download requests, comprising:
-
retrieving instructions for; retrieving download requests for documents via an internal interface, each of said documents being identified by a different URL; queuing download requests to retrieve documents from the content server wherein requests to the content server having different URLs are queued; and determining if the URL of a first queued request is adjacent to the URL of a document being currently downloaded, wherein adjacent URLs identify documents located on the same content server; if the URL of the first queued request is adjacent to the URL of a document being currently downloaded, waiting until the current document has been downloaded by the content server before initiating one of the queued requests to avoid overlapping requests to the content server; and if the URL of the first queued request is adjacent to the URL of a document being currently downloaded, initiating one of the queued requests; distributing instructions for distributing the initiated download requests among a plurality of external interfaces, and wherein each external interface has a different IP address for communicating with the content server; and downloading instructions for downloading documents corresponding to the initiated download requests from the content server via the plurality of external interfaces. - View Dependent Claims (18, 19, 20)
-
-
21. A system for downloading a plurality of documents from a plurality of content servers, said content servers being linked to a plurality of routers that each have a different network address, said system comprising:
-
a plurality of pullers; a director for; creating a list of document addresses corresponding to the plurality of documents to be downloaded, each of the plurality of said documents being identified by a different document address; and assigning a portion of the list of document addresses to each of the pullers such that each portion assigned to a particular puller includes all documents to be retrieved from one content server wherein no two pullers are assigned document addresses to be retrieved from the same content server; wherein a first puller is responsive to the director for; receiving a first portion of the list of document addresses; queuing a first request to retrieve a first document based on a first document address and queuing a second request to retrieve a second document based on a second document address, said first and second document address included in the list of document addresses; initiating the first queued request to retrieve the first document from a first content server based on the first document address; based on the second document address of the second queued request, determining the content server identified by the second document address; if the determined content server is the first content server and the first content server is in the process of retrieving the first document, waiting until the first document has been received by the first content server before initiating the second queued request to avoid overlapping requests to the first content server; if the determined content server is not the first content server or the first content server is not in the process of retrieving the first document, initiating the second queued request to retrieve the second document from the determined content server; and a proxy gateway responsive to each of the pullers for receiving the initiated requests to retrieve documents, and for retrieving documents corresponding to the list of document addresses from the content servers via the routers. - View Dependent Claims (22)
-
Specification