System and method for prioritizing websites during a webcrawling process
First Claim
1. A prioritization method, comprising:
- extracting, by a web crawler in a computing system, a set of candidate web pages to be crawled, wherein said computing system comprises a memory unit, and wherein said memory unit comprises said web crawler, said set of candidate web pages, an online analysis software application, an offline analysis software application, and a website score database;
associating, by said online analysis software application, each web page in said set of candidate web pages with a website in a computer network;
determining online, by said online analysis software application, if a first website score for said website, is in said website score database;
associating, by said online analysis software application, said first website score for said website with associated web pages in said set of candidate web pages, if said first website score exists in said website score database;
prioritizing, said set of candidate web pages with respect to an associated website score for each web page in said candidate set of web pages;
retrieving, by said web crawler, content from said set of candidate web pages using said prioritizing;
extracting, by said online analysis software application, hyperlinks from said content;
storing said hyperlinks in said memory unit.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method for prioritizing a fetch order of web pages. The method comprises extracting by a web crawler a set of candidate web pages to be crawled. Each web page in the set of candidate web pages is associated with a website in a computer network. A determination is made to determine if a first website score for the website is in a website score database. The first website score is associated with web pages in the set of candidate web pages if the first website score exists in the website score database. The set of candidate web pages is prioritized with respect to an associated website score for each web page in the candidate set of web pages. Content is retrieved from the set of candidate web. Hyperlinks are extracted from the content. The hyperlinks are stored in a memory unit.
-
Citations
20 Claims
-
1. A prioritization method, comprising:
-
extracting, by a web crawler in a computing system, a set of candidate web pages to be crawled, wherein said computing system comprises a memory unit, and wherein said memory unit comprises said web crawler, said set of candidate web pages, an online analysis software application, an offline analysis software application, and a website score database;
associating, by said online analysis software application, each web page in said set of candidate web pages with a website in a computer network;
determining online, by said online analysis software application, if a first website score for said website, is in said website score database;
associating, by said online analysis software application, said first website score for said website with associated web pages in said set of candidate web pages, if said first website score exists in said website score database;
prioritizing, said set of candidate web pages with respect to an associated website score for each web page in said candidate set of web pages;
retrieving, by said web crawler, content from said set of candidate web pages using said prioritizing;
extracting, by said online analysis software application, hyperlinks from said content;
storing said hyperlinks in said memory unit. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computing system comprising a processor coupled to a computer-readable memory unit, said memory unit comprising a web crawler, a set of candidate web pages, an online analysis software application, an offline analysis software application, a website score database, and instructions that when executed by the processor implement a prioritization method, said method comprising:
-
extracting, by said web crawler, said set of candidate web pages to be crawled;
associating, by said online analysis software application, each web page in said set of candidate web pages with a website in a computer network;
determining online, by said online analysis software application, if a first website score for said website, is in said website score database;
associating, by said online analysis software application, said first website score for said website with associated web pages in said set of candidate web pages, if said first website score exists in said website score database;
prioritizing, said set of candidate web pages with respect to an associated website score for each web page in said candidate set of web pages;
retrieving, by said web crawler, content from said set of candidate web pages using said prioritizing;
extracting, by said online analysis software application, hyperlinks from said content;
storing said hyperlinks in said memory unit. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A computer program product, comprising a computer usable medium including an online analysis software application, an offline analysis software application, a website score database, a web crawler, a set of candidate web pages, and computer readable program code embodied therein, said computer readable program code comprising an algorithm adapted to implement a prioritization method within a computing system, said method comprising:
-
extracting, by said web crawler, said set of candidate web pages to be crawled;
associating, by said online analysis software application, each web page in said set of candidate web pages with a website in a computer network;
determining online, by said online analysis software application, if a first website score for said website, is in said website score database;
associating, by said online analysis software application, said first website score for said website with associated web pages in said set of candidate web pages, if said first website score exists in said website score database;
prioritizing, said set of candidate web pages with respect to an associated website score for each web page in said candidate set of web pages;
retrieving, by said web crawler, content from said set of candidate web pages using said prioritizing;
extracting, by said online analysis software application, hyperlinks from said content;
storing said hyperlinks in said memory unit. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification