Method and system for scheduling web crawlers according to keyword search
First Claim
1. A method for scheduling web crawlers according to a keyword search, characterized in comprising:
- a step A of receiving a task request command;
a step B of acquiring a secondary download link address from a first bucket that stores secondary download link addresses, generating a task for crawling the secondary download link address, adding the generated tasks into a task list, and if the quantities allowed to be added into the task list from the first bucket are reached, performing a step C, and otherwise performing a step D, wherein the secondary download link addresses from the first bucket are link addresses that need secondary download acquired from analysis of crawled pages according to the task in the task list;
the step D of acquiring a keyword link address from a second bucket that stores keyword multipage link addresses, deriving derivative link addresses of each of a plurality of pages corresponding to the keyword link address, generating tasks of the plurality of pages according to the derivative link addresses of the plurality of pages, adding the tasks of the plurality of pages into the task list, and if the quantities allowed to be added into the task list from the second bucket are reached, performing the step C, and otherwise performing a step E, wherein each keyword multipage link address from the second bucket is a link address of search result pages generated in a target website according to the keyword, wherein a quantity of search result pages for each link address stored in the second bucket is no less than a preset threshold that is no less than 2;
the step E of acquiring a keyword link address from a third bucket that stores keyword link addresses, generating tasks, adding the generated tasks into the task list, and if the quantities allowed to be added into the task list from the third bucket are reached, performing the step C, wherein the keyword link addresses from the third bucket are link addresses of search result pages generated in a target website according to the keyword;
the step C of returning the task list to a web crawler, and the web crawler performing the task in the task list according to the received task list, wherein performing the task includes crawling a page according to the task, analyzing the page to acquire analysis data including the secondary download link addresses, information details, or a quantity of search result pages,if the analysis data is the secondary download link addresses, placing the secondary download link addresses in the first bucket;
if the analysis data is the information details, placing the information details in a fourth bucket; and
if the analysis data is the quantity of search result pages, adjusting the keyword link address corresponding to the search result pages in the second bucket and the third bucket, wherein adjusting the keyword link addresses corresponding to the search result pages in the second bucket and the third bucket specifically comprises;
setting the quantity of the search result pages received in the analysis data as a new quantity of the search result pages, and setting the quantity of the search result pages previously received for the same keyword link address as an old quantity of the search result pages; and
if the old quantity of the search result pages is not consistent with the new quantity of the search result pages;
if the old quantity of the search result pages is no less than the preset threshold and the new quantity of the search result pages is no less than the preset threshold, modifying the corresponding quantity of the pages corresponding to the keyword link address to the new quantity of the search result pages;
orif the old quantity of the search result pages is no less than the preset threshold and the new quantity of the search result pages is less than the preset threshold, moving the corresponding keyword link address to the third bucket;
orif the old quantity of the search result pages is less than the preset threshold and the new quantity of the search result pages is no less than the preset threshold, moving the corresponding keyword link address to the second bucket.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and a system for scheduling web crawlers according to keyword search. The method comprises: a scheduling end receiving a task request command sent by a crawling node; the scheduling end acquiring a secondary download link address from a priority bucket, generating tasks, adding the generated tasks into a task list, acquiring keyword link addresses from a dynamic bucket, deriving derivative link addresses of the quantities of pages corresponding to the keyword link addresses, generating tasks of the quantities of the pages according to the derivative link addresses of the quantities of the pages, adding the tasks of the quantities of the pages into the task list, acquiring a keyword link address from a basic bucket, generating tasks, adding the generated tasks into the task list, and the scheduling end returning the task list to the crawling node. By adjusting the quantities of the tasks allowed to be added from a virtual bucket, the quantities of scheduled link addresses of different types are flexibly adjusted. In addition, by crawling popular keywords more frequently, data miss is prevented, and repeated crawls of unpopular keywords is reduced.
10 Citations
12 Claims
-
1. A method for scheduling web crawlers according to a keyword search, characterized in comprising:
-
a step A of receiving a task request command; a step B of acquiring a secondary download link address from a first bucket that stores secondary download link addresses, generating a task for crawling the secondary download link address, adding the generated tasks into a task list, and if the quantities allowed to be added into the task list from the first bucket are reached, performing a step C, and otherwise performing a step D, wherein the secondary download link addresses from the first bucket are link addresses that need secondary download acquired from analysis of crawled pages according to the task in the task list; the step D of acquiring a keyword link address from a second bucket that stores keyword multipage link addresses, deriving derivative link addresses of each of a plurality of pages corresponding to the keyword link address, generating tasks of the plurality of pages according to the derivative link addresses of the plurality of pages, adding the tasks of the plurality of pages into the task list, and if the quantities allowed to be added into the task list from the second bucket are reached, performing the step C, and otherwise performing a step E, wherein each keyword multipage link address from the second bucket is a link address of search result pages generated in a target website according to the keyword, wherein a quantity of search result pages for each link address stored in the second bucket is no less than a preset threshold that is no less than 2; the step E of acquiring a keyword link address from a third bucket that stores keyword link addresses, generating tasks, adding the generated tasks into the task list, and if the quantities allowed to be added into the task list from the third bucket are reached, performing the step C, wherein the keyword link addresses from the third bucket are link addresses of search result pages generated in a target website according to the keyword; the step C of returning the task list to a web crawler, and the web crawler performing the task in the task list according to the received task list, wherein performing the task includes crawling a page according to the task, analyzing the page to acquire analysis data including the secondary download link addresses, information details, or a quantity of search result pages, if the analysis data is the secondary download link addresses, placing the secondary download link addresses in the first bucket; if the analysis data is the information details, placing the information details in a fourth bucket; and if the analysis data is the quantity of search result pages, adjusting the keyword link address corresponding to the search result pages in the second bucket and the third bucket, wherein adjusting the keyword link addresses corresponding to the search result pages in the second bucket and the third bucket specifically comprises; setting the quantity of the search result pages received in the analysis data as a new quantity of the search result pages, and setting the quantity of the search result pages previously received for the same keyword link address as an old quantity of the search result pages; and if the old quantity of the search result pages is not consistent with the new quantity of the search result pages; if the old quantity of the search result pages is no less than the preset threshold and the new quantity of the search result pages is no less than the preset threshold, modifying the corresponding quantity of the pages corresponding to the keyword link address to the new quantity of the search result pages;
orif the old quantity of the search result pages is no less than the preset threshold and the new quantity of the search result pages is less than the preset threshold, moving the corresponding keyword link address to the third bucket;
orif the old quantity of the search result pages is less than the preset threshold and the new quantity of the search result pages is no less than the preset threshold, moving the corresponding keyword link address to the second bucket. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A system for scheduling web crawlers according to a keyword search, characterized in comprising a scheduling end, and at least one web crawler that communicates with the scheduling end,
the scheduling end comprising: -
a task request command receiving module for receiving a task request command sent by the web crawler; a first bucket task generation module for acquiring a secondary download link address from a first bucket that stores secondary download link addresses, generating tasks, and adding the generated tasks into a task list, and if the quantities allowed to be added into the task list from the first bucket are reached, a task list returning module is executed, and otherwise a second bucket task generation module is executed, wherein the secondary download link addresses from the first bucket are link addresses that need secondary download acquired from analysis of crawled pages crawled by the web crawler according to the task in the task list; the second bucket task generation module for acquiring a keyword link address from a second bucket that stores keyword multipage link addresses, deriving derivative link addresses of a plurality of pages corresponding to the keyword link address, generating tasks of the plurality of pages according to the derivative link addresses of the plurality of pages, adding the tasks of the plurality of pages into the task list, and if the quantities allowed to be added into the task list from the second bucket are reached, the task list returning module is executed, and otherwise a third bucket task generation module is executed, wherein the keyword link addresses from the second bucket are link addresses of a plurality of search result pages generated in a target web site according to the keyword, wherein a quantity of search result pages for each link address stored in the second bucket is no less than a preset threshold that is no less than 2; the third bucket task generation module for acquiring a keyword link address from a third bucket that stores keyword link addresses, generating tasks, adding the generated tasks into the task list, and if the quantities allowed to be added into the task list from the third bucket are reached, the task list returning module is executed, wherein the keyword link addresses from the third bucket are link addresses of search result pages generated in a target website according to the keyword; and the task list returning module for returning the task list to the web crawler; and the web crawler comprising; the task request command sending module for sending a task request command to the scheduling end; and a task performing module for performing at least one task in the list according to the received task list, wherein the task performing module is specifically used for; crawling pages according to the at least one the task in the list, analyzing the crawled pages to acquire analysis data including the secondary download link addresses, information details, or a quantity of search result pages, and sending the analysis data to the scheduling end; and the scheduling end further comprises an analysis data receiving module for receiving the analysis data, wherein the analysis data receiving module is specifically used for; if the analysis data is the secondary download link addresses, placing the secondary download link addresses in the first bucket, if the analysis data is the information details, placing the information details in a fourth bucket, and if the analysis data is the quantity of the search result pages, adjusting the keyword link address corresponding to the search result pages in the second bucket and the third bucket, wherein adjusting the keyword link address corresponding to the search result pages in the second bucket and the third bucket by the analysis data receiving module specifically comprises; setting the quantity of the search result pages received in the analysis data as a new quantity of the search result pages, and setting the quantity of the search result pages previously received for the same keyword link address as an old quantity of the search result pages; and if the old quantity of the search result pages is not consistent with the new quantity of the search result pages; if the old quantity of the search result pages is no less than the preset threshold and the new quantity of the search result pages is no less than the preset threshold, modifying the corresponding quantity of the pages corresponding to the keyword link address to the new quantity of the search result pages;
orif the old quantity of the search result pages is no less than the preset threshold and the new quantity of the search result pages is less than the preset threshold, moving the corresponding keyword link address to the third bucket;
orif the old quantity of the search result pages is less than the preset threshold and the new quantity of the search result pages is no less than the preset threshold, moving the corresponding keyword link address to the second bucket. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A method for scheduling web crawlers according to a keyword search, characterized in comprising:
-
a step A of receiving a task request command; a step B of acquiring a secondary download link address from a first bucket that stores secondary download link addresses, generating a task for crawling the secondary download link address, adding the generated tasks into a task list, and if the quantities allowed to be added into the task list from the first bucket are reached, performing a step C, and otherwise performing a step D, wherein the secondary download link addresses from the first bucket are link addresses that need secondary download acquired from analysis of crawled pages according to the task in the task list; the step D of acquiring a keyword link address from a second bucket that stores keyword multipage link addresses, deriving derivative link addresses of each of a plurality of pages corresponding to the keyword multipage link address, generating tasks of the plurality of pages according to the derivative link addresses of the plurality of pages, adding the tasks of the plurality of pages to the task list, and if the quantities allowed to be added into the task list from the second bucket are reached, performing the step C, and otherwise performing a step E, wherein each keyword multipage link address from the second bucket is a link address of search result pages generated in a target web site according to the keyword, wherein a quantity of search result pages for each link address stored in the second bucket is no less than a preset threshold that is no less than 2; the step E of acquiring a keyword link address from a third bucket that stores keyword link addresses, generating tasks, adding the generated tasks into the task list, and if the quantities allowed to be added into the task list from the third bucket are reached, performing the step C, wherein the keyword link addresses from the third bucket are link addresses of search result pages generated in a target web site according to the keyword, characterized in that the third bucket comprises an active bucket and a suspended bucket and the step E specifically comprises; acquiring a keyword link address with the earliest scheduling time from the active bucket that stores the keyword link addresses, generating tasks, and adding the generated tasks into the task list; increasing the scheduling times for keyword link addresses, for which the tasks have been generated, by a preset scheduling time increase and then moving them to the suspended bucket; and if the quantities allowed to be added into the task list from the third bucket are reached, performing the step C, otherwise, if the active bucket further stores keyword link addresses, performing the step E, and if the active bucket stores no keyword link addresses, performing the step C; the step C of returning the task list to a web crawler, and the web crawler performing the task in the task list according to the received task list, wherein performing the task includes crawling a page according to the task, analyzing the page to acquire analysis data including the secondary download link addresses, information details, or a quantity of search result pages; if the analysis data is the secondary download link addresses, placing the secondary download link addresses in the first bucket; if the analysis data is the information details, placing the information details in a fourth bucket; and if the analysis data is the quantity of search result pages, setting the quantity of the search result pages received in the analysis data as a new quantity of the search result pages, and setting the quantity of the search result pages previously received for the same keyword link address as an old quantity of the search result pages; and if the old quantity of the search result pages is not consistent with the new quantity of the search result pages; if the old quantity of the search result pages is no less than the preset threshold and the new quantity of the search result pages is no less than the preset threshold, modifying the corresponding quantity of the pages corresponding to the keyword link address to the new quantity of the search result pages;
orif the old quantity of the search result pages is no less than the preset threshold and the new quantity of the search result pages is less than the preset threshold, moving the corresponding keyword link address to the active bucket of the second bucket;
orif the old quantity of the search result pages is less than the preset threshold and the new quantity of the search result pages is no less than the preset threshold, moving the corresponding keyword link address to the second bucket;
orif the old quantity of the search result pages is less than the preset threshold and the new quantity of the search result pages is less than the preset threshold, searching the suspended bucket of the second bucket and moving the keyword link address, the scheduling times for which in the suspended bucket reach a current time, into the active bucket.
-
-
12. A system for scheduling web crawlers according to a keyword search, characterized in comprising a scheduling end, and at least one web crawler that communicates with the scheduling end,
the scheduling end comprising: -
a task request command receiving module for receiving a task request command sent by the web crawler; a first bucket task generation module for acquiring a secondary download link address from a first bucket that stores secondary download link addresses, generating tasks, and adding the generated tasks into a task list, and if the quantities allowed to be added into the task list from the first bucket are reached, a task list returning module is executed, and otherwise a second bucket task generation module is executed, wherein the secondary download link addresses from the first bucket are link addresses that need secondary download acquired from analysis of crawled pages crawled by the web crawler according to the task in the task list; the second bucket task generation module for acquiring a keyword link address from a second bucket that stores keyword multipage link addresses, deriving derivative link addresses of a plurality of pages corresponding to the keyword link address, generating tasks of the plurality of pages according to the derivative link addresses of the plurality of pages, adding the tasks of the plurality of pages into the task list, and if the quantities allowed to be added into the task list from the second bucket are reached, the task list returning module is executed, and otherwise a third bucket task generation module is executed, wherein the keyword link addresses from the second bucket are link addresses of a plurality of search result pages generated in a target web site according to the keyword, wherein a quantity of search result pages for each link address stored in the second bucket is no less than a preset threshold that is no less than 2; the third bucket task generation module for acquiring a keyword link address from a third bucket that stores keyword link addresses, generating tasks, adding the generated tasks into the task list, and if the quantities allowed to be added into the task list from the third bucket are reached, the task list returning module is executed, wherein the keyword link addresses from the third bucket are link addresses of search result pages generated in a target web site according to the keyword; and the task list returning module for returning the task list to the web crawler; and the web crawler comprising; the task request command sending module for sending a task request command to the scheduling end; and a task performing module for performing at least one task in the task list according to the received task list, characterized in that; the third bucket comprise an active bucket and a suspended bucket; the third bucket task generation module is specifically used for; acquiring a keyword link address with the earliest scheduling time from the active bucket that stores the keyword link addresses, generating tasks, and adding the generated tasks into the task list, and increasing the scheduling times for the keyword link addresses, for which the tasks have been generated, by a preset scheduling time increase and then moving them to the suspended bucket; and if the quantities allowed to be added into the task list from the third bucket are reached, the task list returning module is executed, otherwise, if the active bucket further stores keyword link addresses, the third bucket task generation module is executed, and if the active bucket stores no keyword link addresses, the task list returning module is executed, the task performing module is specifically used for; crawling a page according to the task in the task list, analyzing the page to acquire analysis data including the secondary download link addresses, information details, or a quantity of search result pages and sending the analysis data to the scheduling end; and the scheduling end further comprises an analysis data receiving module used for receiving the analysis data; if the analysis data is the secondary download link addresses, placing the secondary download link addresses in the first bucket; if the analysis data is the information details, placing the information details in a fourth bucket; and if the analysis data is the quantity of search result pages, setting the quantity of the search result pages received in the analysis data as a new quantity of the search result pages, and setting the quantity of search result pages previously received for the same keyword link address as an old quantity of the search result pages;
if the old quantity of the search result pages is no less than the preset threshold and the new quantity of the search result pages is no less than the preset threshold, modifying the corresponding quantity of the search result pages corresponding to the keyword link address to the new quantity of the search result pages;
or
if the old quantity of the search result pages are no less than the preset threshold and the new quantity of the search result pages is less than the preset threshold, moving the corresponding keyword link address to the active bucket;
or
if the old quantity of the search result pages is less than the preset threshold and the new quantity of the search result pages is no less than the preset threshold, moving the corresponding keyword link address to the second bucket;
or
if the old quantity of the search result pages is less than the preset threshold and the new quantity of the search result pages is less than the preset threshold, searching the suspended bucket, and moving the keyword link address, the scheduling times for which in the suspended bucket reach a current time, into the active bucket.
-
Specification