System, method and computer readable medium for web crawling

US 9,940,391 B2
Filed: 11/02/2011
Issued: 04/10/2018
Est. Priority Date: 05/05/2009
Status: Active Grant

First Claim

Patent Images

1. A method of selecting a uniform resource locator (URL) in a web crawling procedure, the method comprising:

querying a URL data store for URLs corresponding to webpages which have not been previously downloaded by a web crawler computing device during a web crawling procedure;

retrieving from the URL data store, in response to the querying, a plurality of URLs corresponding to webpages which have not been previously downloaded by the web crawler computing device during the web crawling procedure;

for each particular URL in the plurality of retrieved URLs;

(a) identifying one or more additional URLs within the URL data store corresponding to webpages that have been downloaded during the web crawling procedure, and that contain links to the particular URL; and

(b) retrieving, from interaction data store, user interaction event data associated with the identified one or more additional URLs that include links to the particular URL;

selecting a highest ranked URL from the plurality of retrieved URLs, based on the user interaction event data retrieved from the interaction data store associated with the plurality of retrieved URLs, wherein the selection of a particular URL as the highest ranked URL comprises performing a source element ranking and attention shift analysis for one or more web pages that include links to the particular URL, wherein performing the source element ranking and attention shift analysis comprises;

(a) detecting user interactions with content areas of each of the one or more web pages that include links to the particular URL, wherein the user interactions detected comprise one or more of clicks, hints, or lingers on the on the content areas of the web pages; and

(b) determining, based on the detected user interactions with the content areas, that the one or more web pages that include links to the particular URL are valid web pages;

submitting the selected URL to a download module;

downloading, by the download module, a webpage corresponding to the selected URL;

processing the downloaded webpage to extract any URLs within the downloaded webpage; and

adding the extracted URLs to the URL data store.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In a web crawler, a URL selection module selects URLs for pages to be downloaded. The URL selection module accesses an interaction data store that stores interaction data for web pages, including interaction data that indicates human interactions with the pages. To reduce the effects of link farms, the URL selection module filters the URLs to select only those URLs that have human interaction histories and provides the selected URLs to a download module for web page downloading.

Citations

20 Claims

1. A method of selecting a uniform resource locator (URL) in a web crawling procedure, the method comprising:
- querying a URL data store for URLs corresponding to webpages which have not been previously downloaded by a web crawler computing device during a web crawling procedure;
  
  retrieving from the URL data store, in response to the querying, a plurality of URLs corresponding to webpages which have not been previously downloaded by the web crawler computing device during the web crawling procedure;
  
  for each particular URL in the plurality of retrieved URLs;
  
  (a) identifying one or more additional URLs within the URL data store corresponding to webpages that have been downloaded during the web crawling procedure, and that contain links to the particular URL; and
  
  (b) retrieving, from interaction data store, user interaction event data associated with the identified one or more additional URLs that include links to the particular URL;
  
  selecting a highest ranked URL from the plurality of retrieved URLs, based on the user interaction event data retrieved from the interaction data store associated with the plurality of retrieved URLs, wherein the selection of a particular URL as the highest ranked URL comprises performing a source element ranking and attention shift analysis for one or more web pages that include links to the particular URL, wherein performing the source element ranking and attention shift analysis comprises;
  
  (a) detecting user interactions with content areas of each of the one or more web pages that include links to the particular URL, wherein the user interactions detected comprise one or more of clicks, hints, or lingers on the on the content areas of the web pages; and
  
  (b) determining, based on the detected user interactions with the content areas, that the one or more web pages that include links to the particular URL are valid web pages;
  
  submitting the selected URL to a download module;
  
  downloading, by the download module, a webpage corresponding to the selected URL;
  
  processing the downloaded webpage to extract any URLs within the downloaded webpage; and
  
  adding the extracted URLs to the URL data store.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein the selection of the highest ranked URL is based on retrieving at least one predetermined human interaction behavior.
  - 3. The method of claim 2, wherein the at least one predetermined human interaction behavior comprises a content of interest ranking within the web pages that include links to the URL.
  - 4. The method of claim 1, wherein selecting the highest ranked URL comprises:
    - ranking the plurality of URLs based on at least one user behavior action out-click of web pages that include links to the URL.
  - 5. The method of claim 1, wherein selecting the highest ranked URL comprises:
    - ranking the plurality of URLs based on a location of the URL within known human attentive areas of a web page that includes a link to the URL.
  - 6. The method of claim 1, further comprising:
    - querying a content interest data store to determine whether the downloaded web page comprises content interest data.
  - 7. The method of claim 6, further comprising:
    - ranking one or more page elements of the downloaded web page according to their respective content interest score, in response to determining that the downloaded includes content interest data.

8. An apparatus configured to select a uniform resource locator (URL) in a web crawling procedure, the apparatus comprising:
- one or more processors; and
  
  memory storing computer-readable instructions that, when executed by the one or more processors, cause the apparatus to;
  
  query a URL data store for URLs corresponding to webpages which have not been previously downloaded device during a web crawling procedure;
  
  retrieve from the URL data store, in response to the querying, a plurality of URLs corresponding to webpages which have not been previously downloaded during the web crawling procedure;
  
  for each particular URL in the plurality of retrieved URLs;
  
  (a) identify one or more additional URLs within the URL data store corresponding to webpages that have been downloaded during the web crawling procedure, and that contain links to the particular URL; and
  
  (b) retrieve, from interaction data store, user interaction event data associated with the identified one or more additional URLs that include links to the particular URL;
  
  select a highest ranked URL from the plurality of retrieved URLs based on the user interaction event data retrieved from the interaction data store associated with the plurality of retrieved URLs, wherein the selection of a particular URL as the highest ranked URL comprises performing a source element ranking and attention shift analysis for one or more web pages that include links to the particular URL, wherein performing the source element ranking and attention shift analysis comprises;
  
  (a) detecting user interactions with content areas of each of the one or more web pages that include links to the particular URL, wherein the user interactions detected comprise one or more of clicks, hints, or lingers on the on the content areas of the web pages; and
  
  (b) determining, based on the detected user interactions with the content areas, that the one or more web pages that include links to the particular URL are valid web pages;
  
  submit the selected URL to a download module;
  
  receive the selected URL;
  
  download a webpage corresponding to the selected URL;
  
  process the downloaded webpage to extract identify URLs within the downloaded webpage; and
  
  add the extracted URLs to the URL data store.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The apparatus of claim 8, wherein the selection of the highest ranked URL is based on retrieving at least one predetermined human interaction behavior.
  - 10. The apparatus of claim 9, wherein the at least one predetermined human interaction behavior comprises a content of interest ranking within the web pages that include links to the URL.
  - 11. The apparatus of claim 8, wherein selecting the highest ranked URL comprises:
    - ranking the plurality of URLs based on at least one user behavior action out-click of web pages that include links to the URL.
  - 12. The apparatus of claim 8, wherein selecting the highest ranked URL comprises:
    - ranking the plurality of URLs based on a location of the URL within known human attentive areas of a web page that includes a link to the URL.
  - 13. The apparatus of claim 8, the memory storing further computer-readable instructions that when executed by the one or more processors, cause the apparatus to:
    - query a content interest data store to determine whether the downloaded web page comprises content interest data.
  - 14. The apparatus of claim 13, the memory storing further computer-readable instructions that when executed by the one or more processors, cause the apparatus to:
    - rank one or more page elements of the downloaded web page according to their respective content interest score, in response to determining that the downloaded includes content interest data.

15. A non-transitory computer readable storage medium configured to store instructions that when executed cause a processor to perform selecting a uniform resource locator (URL) in a web crawling procedure, the processor being further configured to perform:
- querying a URL data store for URLs corresponding to webpages which have not been previously downloaded by a web crawler computing device during a web crawling procedure;
  
  retrieving from the URL data store, in response to the querying, a plurality of URLs corresponding to webpages which have not been previously downloaded by the web crawler computing device during the web crawling procedure;
  
  for each particular URL in the plurality of retrieved URLs;
  
  (a) identifying one or more additional URLs within the URL data store corresponding to webpages that have been downloaded during the web crawling procedure, and that contain links to the particular URL; and
  
  (b) retrieving, from interaction data store, user interaction event data associated with the identified one or more additional URLs that include links to the particular URL;
  
  selecting a highest ranked URL from the plurality of retrieved URLs based on the user interaction event data records retrieved from the interaction data store associated with the plurality of retrieved URLs, wherein the selection of a particular URL as the highest ranked URL comprises performing a source element ranking and attention shift analysis for one or more web pages that include links to the particular URL, wherein performing the source element ranking and attention shift analysis comprises;
  
  (a) detecting user interactions with content areas of each of the one or more web pages that include links to the particular URL, wherein the user interactions detected comprise one or more of clicks, hints, or lingers on the on the content areas of the web pages; and
  
  (b) determining, based on the detected user interactions with the content areas, that the one or more web pages that include links to the particular URL are valid web pages;
  
  submitting the selected URL to a download module;
  
  downloading, by the download module, a webpage corresponding to the selected URL;
  
  processing the downloaded webpage to extract any URLs within the downloaded webpage; and
  
  adding the extracted URLs to the URL data store.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The non-transitory computer readable storage medium of claim 15, wherein the selection of the highest ranked URL is based on retrieving at least one predetermined human interaction behavior.
  - 17. The non-transitory computer readable storage medium of claim 16, wherein the at least one predetermined human interaction behavior comprises a content of interest ranking within the web pages that include links to the URL.
  - 18. The non-transitory computer readable storage medium of claim 15, wherein selecting the highest ranked URL comprises:
    - ranking the plurality of URLs based on at least one user behavior action out-click of web pages that include links to the URL.
  - 19. The non-transitory computer readable storage medium of claim 15, wherein selecting the highest ranked URL comprises:
    - ranking the plurality of URLs based on a location of the URL within known human attentive areas of a web page that includes a link to the URL.
  - 20. The non-transitory computer readable storage medium of claim 15, wherein the processor is further configured to perform:
    - querying a content interest data store to determine whether the downloaded web page comprises content interest data.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Oracle America, Inc. (Oracle Corporation)
Original Assignee
Oracle America, Inc. (Oracle Corporation)
Inventors
Hauser, Robert R
Primary Examiner(s)
GOFMAN, ALEX N

Application Number

US13/287,535
Publication Number

US 20120047122A1
Time in Patent Office

2,351 Days
Field of Search

707726, 707723
US Class Current
CPC Class Codes

G06F 16/951 Indexing; Web crawling tech...

System, method and computer readable medium for web crawling

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

System, method and computer readable medium for web crawling

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links