×

Method and system for scheduling web crawlers according to keyword search

  • US 10,185,771 B2
  • Filed: 01/09/2015
  • Issued: 01/22/2019
  • Est. Priority Date: 01/09/2014
  • Status: Active Grant
First Claim
Patent Images

1. A method for scheduling web crawlers according to a keyword search, characterized in comprising:

  • a step A of receiving a task request command;

    a step B of acquiring a secondary download link address from a first bucket that stores secondary download link addresses, generating a task for crawling the secondary download link address, adding the generated tasks into a task list, and if the quantities allowed to be added into the task list from the first bucket are reached, performing a step C, and otherwise performing a step D, wherein the secondary download link addresses from the first bucket are link addresses that need secondary download acquired from analysis of crawled pages according to the task in the task list;

    the step D of acquiring a keyword link address from a second bucket that stores keyword multipage link addresses, deriving derivative link addresses of each of a plurality of pages corresponding to the keyword link address, generating tasks of the plurality of pages according to the derivative link addresses of the plurality of pages, adding the tasks of the plurality of pages into the task list, and if the quantities allowed to be added into the task list from the second bucket are reached, performing the step C, and otherwise performing a step E, wherein each keyword multipage link address from the second bucket is a link address of search result pages generated in a target website according to the keyword, wherein a quantity of search result pages for each link address stored in the second bucket is no less than a preset threshold that is no less than 2;

    the step E of acquiring a keyword link address from a third bucket that stores keyword link addresses, generating tasks, adding the generated tasks into the task list, and if the quantities allowed to be added into the task list from the third bucket are reached, performing the step C, wherein the keyword link addresses from the third bucket are link addresses of search result pages generated in a target website according to the keyword;

    the step C of returning the task list to a web crawler, and the web crawler performing the task in the task list according to the received task list, wherein performing the task includes crawling a page according to the task, analyzing the page to acquire analysis data including the secondary download link addresses, information details, or a quantity of search result pages,if the analysis data is the secondary download link addresses, placing the secondary download link addresses in the first bucket;

    if the analysis data is the information details, placing the information details in a fourth bucket; and

    if the analysis data is the quantity of search result pages, adjusting the keyword link address corresponding to the search result pages in the second bucket and the third bucket, wherein adjusting the keyword link addresses corresponding to the search result pages in the second bucket and the third bucket specifically comprises;

    setting the quantity of the search result pages received in the analysis data as a new quantity of the search result pages, and setting the quantity of the search result pages previously received for the same keyword link address as an old quantity of the search result pages; and

    if the old quantity of the search result pages is not consistent with the new quantity of the search result pages;

    if the old quantity of the search result pages is no less than the preset threshold and the new quantity of the search result pages is no less than the preset threshold, modifying the corresponding quantity of the pages corresponding to the keyword link address to the new quantity of the search result pages;

    orif the old quantity of the search result pages is no less than the preset threshold and the new quantity of the search result pages is less than the preset threshold, moving the corresponding keyword link address to the third bucket;

    orif the old quantity of the search result pages is less than the preset threshold and the new quantity of the search result pages is no less than the preset threshold, moving the corresponding keyword link address to the second bucket.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×