Method and system for scheduling web crawlers according to keyword search

US 10,185,771 B2
Filed: 01/09/2015
Issued: 01/22/2019
Est. Priority Date: 01/09/2014
Status: Active Grant

First Claim

Patent Images

1. A method for scheduling web crawlers according to a keyword search, characterized in comprising:

a step A of receiving a task request command;

a step B of acquiring a secondary download link address from a first bucket that stores secondary download link addresses, generating a task for crawling the secondary download link address, adding the generated tasks into a task list, and if the quantities allowed to be added into the task list from the first bucket are reached, performing a step C, and otherwise performing a step D, wherein the secondary download link addresses from the first bucket are link addresses that need secondary download acquired from analysis of crawled pages according to the task in the task list;

the step D of acquiring a keyword link address from a second bucket that stores keyword multipage link addresses, deriving derivative link addresses of each of a plurality of pages corresponding to the keyword link address, generating tasks of the plurality of pages according to the derivative link addresses of the plurality of pages, adding the tasks of the plurality of pages into the task list, and if the quantities allowed to be added into the task list from the second bucket are reached, performing the step C, and otherwise performing a step E, wherein each keyword multipage link address from the second bucket is a link address of search result pages generated in a target website according to the keyword, wherein a quantity of search result pages for each link address stored in the second bucket is no less than a preset threshold that is no less than 2;

the step E of acquiring a keyword link address from a third bucket that stores keyword link addresses, generating tasks, adding the generated tasks into the task list, and if the quantities allowed to be added into the task list from the third bucket are reached, performing the step C, wherein the keyword link addresses from the third bucket are link addresses of search result pages generated in a target website according to the keyword;

the step C of returning the task list to a web crawler, and the web crawler performing the task in the task list according to the received task list, wherein performing the task includes crawling a page according to the task, analyzing the page to acquire analysis data including the secondary download link addresses, information details, or a quantity of search result pages,if the analysis data is the secondary download link addresses, placing the secondary download link addresses in the first bucket;

if the analysis data is the information details, placing the information details in a fourth bucket; and

if the analysis data is the quantity of search result pages, adjusting the keyword link address corresponding to the search result pages in the second bucket and the third bucket, wherein adjusting the keyword link addresses corresponding to the search result pages in the second bucket and the third bucket specifically comprises;

setting the quantity of the search result pages received in the analysis data as a new quantity of the search result pages, and setting the quantity of the search result pages previously received for the same keyword link address as an old quantity of the search result pages; and

if the old quantity of the search result pages is not consistent with the new quantity of the search result pages;

if the old quantity of the search result pages is no less than the preset threshold and the new quantity of the search result pages is no less than the preset threshold, modifying the corresponding quantity of the pages corresponding to the keyword link address to the new quantity of the search result pages;

orif the old quantity of the search result pages is no less than the preset threshold and the new quantity of the search result pages is less than the preset threshold, moving the corresponding keyword link address to the third bucket;

orif the old quantity of the search result pages is less than the preset threshold and the new quantity of the search result pages is no less than the preset threshold, moving the corresponding keyword link address to the second bucket.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and a system for scheduling web crawlers according to keyword search. The method comprises: a scheduling end receiving a task request command sent by a crawling node; the scheduling end acquiring a secondary download link address from a priority bucket, generating tasks, adding the generated tasks into a task list, acquiring keyword link addresses from a dynamic bucket, deriving derivative link addresses of the quantities of pages corresponding to the keyword link addresses, generating tasks of the quantities of the pages according to the derivative link addresses of the quantities of the pages, adding the tasks of the quantities of the pages into the task list, acquiring a keyword link address from a basic bucket, generating tasks, adding the generated tasks into the task list, and the scheduling end returning the task list to the crawling node. By adjusting the quantities of the tasks allowed to be added from a virtual bucket, the quantities of scheduled link addresses of different types are flexibly adjusted. In addition, by crawling popular keywords more frequently, data miss is prevented, and repeated crawls of unpopular keywords is reduced.

10 Citations

12 Claims

1. A method for scheduling web crawlers according to a keyword search, characterized in comprising:
- a step A of receiving a task request command;
  
  a step B of acquiring a secondary download link address from a first bucket that stores secondary download link addresses, generating a task for crawling the secondary download link address, adding the generated tasks into a task list, and if the quantities allowed to be added into the task list from the first bucket are reached, performing a step C, and otherwise performing a step D, wherein the secondary download link addresses from the first bucket are link addresses that need secondary download acquired from analysis of crawled pages according to the task in the task list;
  
  the step D of acquiring a keyword link address from a second bucket that stores keyword multipage link addresses, deriving derivative link addresses of each of a plurality of pages corresponding to the keyword link address, generating tasks of the plurality of pages according to the derivative link addresses of the plurality of pages, adding the tasks of the plurality of pages into the task list, and if the quantities allowed to be added into the task list from the second bucket are reached, performing the step C, and otherwise performing a step E, wherein each keyword multipage link address from the second bucket is a link address of search result pages generated in a target website according to the keyword, wherein a quantity of search result pages for each link address stored in the second bucket is no less than a preset threshold that is no less than 2;
  
  the step E of acquiring a keyword link address from a third bucket that stores keyword link addresses, generating tasks, adding the generated tasks into the task list, and if the quantities allowed to be added into the task list from the third bucket are reached, performing the step C, wherein the keyword link addresses from the third bucket are link addresses of search result pages generated in a target website according to the keyword;
  
  the step C of returning the task list to a web crawler, and the web crawler performing the task in the task list according to the received task list, wherein performing the task includes crawling a page according to the task, analyzing the page to acquire analysis data including the secondary download link addresses, information details, or a quantity of search result pages,if the analysis data is the secondary download link addresses, placing the secondary download link addresses in the first bucket;
  
  if the analysis data is the information details, placing the information details in a fourth bucket; and
  
  if the analysis data is the quantity of search result pages, adjusting the keyword link address corresponding to the search result pages in the second bucket and the third bucket, wherein adjusting the keyword link addresses corresponding to the search result pages in the second bucket and the third bucket specifically comprises;
  
  setting the quantity of the search result pages received in the analysis data as a new quantity of the search result pages, and setting the quantity of the search result pages previously received for the same keyword link address as an old quantity of the search result pages; and
  
  if the old quantity of the search result pages is not consistent with the new quantity of the search result pages;
  
  if the old quantity of the search result pages is no less than the preset threshold and the new quantity of the search result pages is no less than the preset threshold, modifying the corresponding quantity of the pages corresponding to the keyword link address to the new quantity of the search result pages;
  
  orif the old quantity of the search result pages is no less than the preset threshold and the new quantity of the search result pages is less than the preset threshold, moving the corresponding keyword link address to the third bucket;
  
  orif the old quantity of the search result pages is less than the preset threshold and the new quantity of the search result pages is no less than the preset threshold, moving the corresponding keyword link address to the second bucket.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method for scheduling web crawlers according to a keyword search according to claim 1, characterized in that the step B specifically comprises:
    - acquiring the secondary download link address from the first bucket that stores secondary download link addresses;
      
      generating tasks;
      
      adding the generated tasks into the task list;
      
      deleting the secondary download link addresses, for which the task has been generated, from the first bucket; and
      
      if the quantities allowed to be added into the task list from the first bucket are reached, performing the step C,otherwise if the first bucket further stores secondary download link addresses, performing the step B, andif all the secondary download link addresses have been deleted from the first bucket, performing the step D.
  - 3. The method for scheduling web crawlers according to a keyword search according to claim 1, characterized in that the step D specifically comprises:
    - acquiring unscheduled keyword link addresses from the second bucket that stores keyword link addresses;
      
      deriving the derivative link addresses of the plurality of pages corresponding to the keyword link addresses;
      
      generating tasks of the plurality of pages according to the derivative link addresses of the plurality of pages and adding the tasks into the task list;
      
      setting states of the keyword link addresses, for which the tasks have been generated, to scheduled; and
      
      if the quantities allowed to be added into the task list from the second bucket are reached, performing the step C, and setting states of all the keyword link addresses in the second bucket to unscheduled,otherwise if the second bucket further stores unscheduled keyword link addresses, performing the step D, andif the second bucket stores no unscheduled keyword link addresses, performing the step E.
  - 4. The method for scheduling web crawlers according to a keyword search according to claim 1, characterized in that the third bucket comprises an active bucket and a suspended bucket;
    - the step E specifically comprises;
      
      acquiring a keyword link address with the earliest scheduling time from the active bucket that stores the keyword link addresses, generating tasks, and adding the generated tasks into the task list;
      
      increasing the scheduling times for keyword link addresses, for which the tasks have been generated, by a preset scheduling time increase and then moving them to the suspended bucket; and
      
      if the quantities allowed to be added into the task list from the third bucket are reached, performing the step C,otherwise, if the active bucket further stores keyword link addresses, performing the step E, andif the active bucket stores no keyword link addresses, performing the step C.
  - 5. The method for scheduling web crawlers according to a keyword search according to claim 1, characterized in that the quantities allowed to be added from the second bucket are more than the quantities allowed to be added from the third bucket.

6. A system for scheduling web crawlers according to a keyword search, characterized in comprising a scheduling end, and at least one web crawler that communicates with the scheduling end,the scheduling end comprising:
- a task request command receiving module for receiving a task request command sent by the web crawler;
  
  a first bucket task generation module for acquiring a secondary download link address from a first bucket that stores secondary download link addresses, generating tasks, and adding the generated tasks into a task list, and if the quantities allowed to be added into the task list from the first bucket are reached, a task list returning module is executed, and otherwise a second bucket task generation module is executed, wherein the secondary download link addresses from the first bucket are link addresses that need secondary download acquired from analysis of crawled pages crawled by the web crawler according to the task in the task list;
  
  the second bucket task generation module for acquiring a keyword link address from a second bucket that stores keyword multipage link addresses, deriving derivative link addresses of a plurality of pages corresponding to the keyword link address, generating tasks of the plurality of pages according to the derivative link addresses of the plurality of pages, adding the tasks of the plurality of pages into the task list, and if the quantities allowed to be added into the task list from the second bucket are reached, the task list returning module is executed, and otherwise a third bucket task generation module is executed, wherein the keyword link addresses from the second bucket are link addresses of a plurality of search result pages generated in a target web site according to the keyword, wherein a quantity of search result pages for each link address stored in the second bucket is no less than a preset threshold that is no less than 2;
  
  the third bucket task generation module for acquiring a keyword link address from a third bucket that stores keyword link addresses, generating tasks, adding the generated tasks into the task list, and if the quantities allowed to be added into the task list from the third bucket are reached, the task list returning module is executed, wherein the keyword link addresses from the third bucket are link addresses of search result pages generated in a target website according to the keyword; and
  
  the task list returning module for returning the task list to the web crawler; and
  
  the web crawler comprising;
  
  the task request command sending module for sending a task request command to the scheduling end; and
  
  a task performing module for performing at least one task in the list according to the received task list, wherein the task performing module is specifically used for;
  
  crawling pages according to the at least one the task in the list,analyzing the crawled pages to acquire analysis data including the secondary download link addresses, information details, or a quantity of search result pages, andsending the analysis data to the scheduling end; and
  
  the scheduling end further comprises an analysis data receiving module for receiving the analysis data, wherein the analysis data receiving module is specifically used for;
  
  if the analysis data is the secondary download link addresses, placing the secondary download link addresses in the first bucket,if the analysis data is the information details, placing the information details in a fourth bucket, andif the analysis data is the quantity of the search result pages, adjusting the keyword link address corresponding to the search result pages in the second bucket and the third bucket, wherein adjusting the keyword link address corresponding to the search result pages in the second bucket and the third bucket by the analysis data receiving module specifically comprises;
  
  setting the quantity of the search result pages received in the analysis data as a new quantity of the search result pages, and setting the quantity of the search result pages previously received for the same keyword link address as an old quantity of the search result pages; and
  
  if the old quantity of the search result pages is not consistent with the new quantity of the search result pages;
  
  if the old quantity of the search result pages is no less than the preset threshold and the new quantity of the search result pages is no less than the preset threshold, modifying the corresponding quantity of the pages corresponding to the keyword link address to the new quantity of the search result pages;
  
  orif the old quantity of the search result pages is no less than the preset threshold and the new quantity of the search result pages is less than the preset threshold, moving the corresponding keyword link address to the third bucket;
  
  orif the old quantity of the search result pages is less than the preset threshold and the new quantity of the search result pages is no less than the preset threshold, moving the corresponding keyword link address to the second bucket.
- View Dependent Claims (7, 8, 9, 10)
- - 7. The system for scheduling web crawlers according to a keyword search according to claim 6, characterized in that:
    - the first bucket task generation module is specifically used for;
      
      acquiring the secondary download link address from the first bucket that stores secondary download link addresses,generating tasks,adding the generated tasks into the task list, anddeleting the secondary download link addresses, for which the task has been generated, from the first bucket; and
      
      if the quantities allowed to be added into the task list from the first bucket are reached, the task list returning module is executed,otherwise if the first bucket further stores secondary download link addresses, the first bucket task generation module is executed, andif all the secondary download link addresses have been deleted from the first bucket, the second bucket task generation module is executed.
  - 8. The system for scheduling web crawlers according to a keyword search according to claim 6, characterized in that:
    - the second bucket task generation module is specifically used for;
      
      acquiring unscheduled keyword link addresses from the second bucket that stores keyword link addresses;
      
      deriving the derivative link addresses of the plurality of pages corresponding to the keyword link addresses;
      
      generating tasks of the plurality of pages according to the derivative link addresses of the plurality of pages and adding the tasks into the task list; and
      
      setting states of the keyword link addresses, for which the tasks have been generated, to scheduled; and
      
      if the quantities allowed to be added into the task list from the second bucket are reached, the task list returning module is executed, and states of all the keyword link addresses in the second bucket are set into unscheduled,otherwise if the second bucket further stores unscheduled keyword link addresses, the second bucket task generation module is executed, andif the second bucket stores no unscheduled keyword link addresses, the third bucket task generation module is executed.
  - 9. The system for scheduling web crawlers according to a keyword search according to claim 6, characterized in that:
    - the third bucket comprises an active bucket and a suspended bucket;
      
      the third bucket task generation module is specifically used for;
      
      acquiring a keyword link address with the earliest scheduling time from the active bucket that stores the keyword link addresses,generating tasks, and adding the generated tasks into the task list; and
      
      increasing the scheduling times for the keyword link addresses, for which the tasks have been generated, by a preset scheduling time increase and then moving them to the suspended bucket; and
      
      if the quantities allowed to be added into the task list from the third bucket are reached, the task list returning module is executed,otherwise, if the active bucket further stores keyword link addresses, the third bucket task generation module is executed, andif the active bucket stores no keyword link addresses, the task list returning module is executed.
  - 10. The system for scheduling web crawlers according to a keyword search according to claim 6, characterized in that the quantities allowed to be added from the second bucket are more than the quantities allowed to be added from the third bucket.

11. A method for scheduling web crawlers according to a keyword search, characterized in comprising:
- a step A of receiving a task request command;
  
  a step B of acquiring a secondary download link address from a first bucket that stores secondary download link addresses, generating a task for crawling the secondary download link address, adding the generated tasks into a task list, and if the quantities allowed to be added into the task list from the first bucket are reached, performing a step C, and otherwise performing a step D, wherein the secondary download link addresses from the first bucket are link addresses that need secondary download acquired from analysis of crawled pages according to the task in the task list;
  
  the step D of acquiring a keyword link address from a second bucket that stores keyword multipage link addresses, deriving derivative link addresses of each of a plurality of pages corresponding to the keyword multipage link address, generating tasks of the plurality of pages according to the derivative link addresses of the plurality of pages, adding the tasks of the plurality of pages to the task list, and if the quantities allowed to be added into the task list from the second bucket are reached, performing the step C, and otherwise performing a step E, wherein each keyword multipage link address from the second bucket is a link address of search result pages generated in a target web site according to the keyword, wherein a quantity of search result pages for each link address stored in the second bucket is no less than a preset threshold that is no less than 2;
  
  the step E of acquiring a keyword link address from a third bucket that stores keyword link addresses, generating tasks, adding the generated tasks into the task list, and if the quantities allowed to be added into the task list from the third bucket are reached, performing the step C, wherein the keyword link addresses from the third bucket are link addresses of search result pages generated in a target web site according to the keyword, characterized in that the third bucket comprises an active bucket and a suspended bucket and the step E specifically comprises;
  
  acquiring a keyword link address with the earliest scheduling time from the active bucket that stores the keyword link addresses, generating tasks, and adding the generated tasks into the task list;
  
  increasing the scheduling times for keyword link addresses, for which the tasks have been generated, by a preset scheduling time increase and then moving them to the suspended bucket; and
  
  if the quantities allowed to be added into the task list from the third bucket are reached, performing the step C,otherwise, if the active bucket further stores keyword link addresses, performing the step E, andif the active bucket stores no keyword link addresses, performing the step C;
  
  the step C of returning the task list to a web crawler, and the web crawler performing the task in the task list according to the received task list, wherein performing the task includes crawling a page according to the task, analyzing the page to acquire analysis data including the secondary download link addresses, information details, or a quantity of search result pages;
  
  if the analysis data is the secondary download link addresses, placing the secondary download link addresses in the first bucket;
  
  if the analysis data is the information details, placing the information details in a fourth bucket; and
  
  if the analysis data is the quantity of search result pages, setting the quantity of the search result pages received in the analysis data as a new quantity of the search result pages, and setting the quantity of the search result pages previously received for the same keyword link address as an old quantity of the search result pages; and
  
  if the old quantity of the search result pages is not consistent with the new quantity of the search result pages;
  
  if the old quantity of the search result pages is no less than the preset threshold and the new quantity of the search result pages is no less than the preset threshold, modifying the corresponding quantity of the pages corresponding to the keyword link address to the new quantity of the search result pages;
  
  orif the old quantity of the search result pages is no less than the preset threshold and the new quantity of the search result pages is less than the preset threshold, moving the corresponding keyword link address to the active bucket of the second bucket;
  
  orif the old quantity of the search result pages is less than the preset threshold and the new quantity of the search result pages is no less than the preset threshold, moving the corresponding keyword link address to the second bucket;
  
  orif the old quantity of the search result pages is less than the preset threshold and the new quantity of the search result pages is less than the preset threshold, searching the suspended bucket of the second bucket and moving the keyword link address, the scheduling times for which in the suspended bucket reach a current time, into the active bucket.

12. A system for scheduling web crawlers according to a keyword search, characterized in comprising a scheduling end, and at least one web crawler that communicates with the scheduling end,the scheduling end comprising:
- a task request command receiving module for receiving a task request command sent by the web crawler;
  
  a first bucket task generation module for acquiring a secondary download link address from a first bucket that stores secondary download link addresses, generating tasks, and adding the generated tasks into a task list, and if the quantities allowed to be added into the task list from the first bucket are reached, a task list returning module is executed, and otherwise a second bucket task generation module is executed, wherein the secondary download link addresses from the first bucket are link addresses that need secondary download acquired from analysis of crawled pages crawled by the web crawler according to the task in the task list;
  
  the second bucket task generation module for acquiring a keyword link address from a second bucket that stores keyword multipage link addresses, deriving derivative link addresses of a plurality of pages corresponding to the keyword link address, generating tasks of the plurality of pages according to the derivative link addresses of the plurality of pages, adding the tasks of the plurality of pages into the task list, and if the quantities allowed to be added into the task list from the second bucket are reached, the task list returning module is executed, and otherwise a third bucket task generation module is executed, wherein the keyword link addresses from the second bucket are link addresses of a plurality of search result pages generated in a target web site according to the keyword, wherein a quantity of search result pages for each link address stored in the second bucket is no less than a preset threshold that is no less than 2;
  
  the third bucket task generation module for acquiring a keyword link address from a third bucket that stores keyword link addresses, generating tasks, adding the generated tasks into the task list, and if the quantities allowed to be added into the task list from the third bucket are reached, the task list returning module is executed, wherein the keyword link addresses from the third bucket are link addresses of search result pages generated in a target web site according to the keyword; and
  
  the task list returning module for returning the task list to the web crawler; and
  
  the web crawler comprising;
  
  the task request command sending module for sending a task request command to the scheduling end; and
  
  a task performing module for performing at least one task in the task list according to the received task list,characterized in that;
  
  the third bucket comprise an active bucket and a suspended bucket;
  
  the third bucket task generation module is specifically used for;
  
  acquiring a keyword link address with the earliest scheduling time from the active bucket that stores the keyword link addresses, generating tasks, and adding the generated tasks into the task list, and increasing the scheduling times for the keyword link addresses, for which the tasks have been generated, by a preset scheduling time increase and then moving them to the suspended bucket; and
  
  if the quantities allowed to be added into the task list from the third bucket are reached, the task list returning module is executed, otherwise, if the active bucket further stores keyword link addresses, the third bucket task generation module is executed, and if the active bucket stores no keyword link addresses, the task list returning module is executed,the task performing module is specifically used for;
  
  crawling a page according to the task in the task list, analyzing the page to acquire analysis data including the secondary download link addresses, information details, or a quantity of search result pages and sending the analysis data to the scheduling end; and
  
  the scheduling end further comprises an analysis data receiving module used forreceiving the analysis data;
  
  if the analysis data is the secondary download link addresses, placing the secondary download link addresses in the first bucket;
  
  if the analysis data is the information details, placing the information details in a fourth bucket; and
  
  if the analysis data is the quantity of search result pages, setting the quantity of the search result pages received in the analysis data as a new quantity of the search result pages, and setting the quantity of search result pages previously received for the same keyword link address as an old quantity of the search result pages;
  
  if the old quantity of the search result pages is no less than the preset threshold and the new quantity of the search result pages is no less than the preset threshold, modifying the corresponding quantity of the search result pages corresponding to the keyword link address to the new quantity of the search result pages;
  
  or
  
  if the old quantity of the search result pages are no less than the preset threshold and the new quantity of the search result pages is less than the preset threshold, moving the corresponding keyword link address to the active bucket;
  
  or
  
  if the old quantity of the search result pages is less than the preset threshold and the new quantity of the search result pages is no less than the preset threshold, moving the corresponding keyword link address to the second bucket;
  
  or
  
  if the old quantity of the search result pages is less than the preset threshold and the new quantity of the search result pages is less than the preset threshold, searching the suspended bucket, and moving the keyword link address, the scheduling times for which in the suspended bucket reach a current time, into the active bucket.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Beijing Jingdong Century Trading Co., Ltd., Beijing Jingdong Shangke Information Technology Co., Ltd. (JD.com, Inc.)
Original Assignee
Beijing Jingdong Century Trading Co., Ltd., Beijing Jingdong Shangke Information Technology Co., Ltd. (JD.com, Inc.)
Inventors
Liao, Yaohua, Li, Xiaowei
Primary Examiner(s)
Ly, Anh

Application Number

US15/110,564
Publication Number

US 20160328475A1
Time in Patent Office

1,474 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/162   Delete operations erasing i...

G06F 16/951   Indexing; Web crawling tech...

G06F 16/958   Organisation or management ...

G06F 9/4881   Scheduling strategies for d...

Method and system for scheduling web crawlers according to keyword search

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

10 Citations

12 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for scheduling web crawlers according to keyword search

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

10 Citations

12 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links