Method and system for preventing web crawling detection

US 7,953,868 B2
Filed: 01/31/2007
Issued: 05/31/2011
Est. Priority Date: 01/31/2007
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method of preventing a detection of web crawling, comprising:

receiving, by a randomizing HTTP proxy server including a CPU coupled to a web crawling module and included in a computer system, a first request from said web crawling module to scan a target website provided by a web server that attempts to detect web crawling by identifying identical source Internet Protocol (IP) addresses of multiple requests to scan said target website and determining the number of said multiple requests to scan said target website exceeds a predefined threshold level, wherein said multiple requests include said first request and a second request from said web crawling module to scan said target website;

forwarding, by said randomizing HTTP proxy server, said first request to a first HTTP proxy computing unit of a plurality of HTTP proxy computing units coupled to said randomizing HTTP proxy server via a network;

said first HTTP proxy computing unit selecting a first router from a plurality of routers based on a first routing table associating a destination IP address of said target website with said first router;

said first router sending said first request to said web server by utilizing a first instance of a network address translation (NAT) algorithm that associates a first plurality of source IP addresses with corresponding HTTP proxy computing units of said plurality of HTTP proxy computing units, and that further associates a first source IP address of said first plurality of source IP addresses with said first HTTP proxy computing unit;

randomly selecting, by said randomizing HTTP proxy server, a second HTTP proxy computing unit of said plurality of HTTP proxy computing units, said second HTTP proxy computing unit being different from said first HTTP proxy computing unit;

receiving, by said randomizing HTTP proxy server, said second request from said web crawling module to scan said target website;

forwarding, by said randomizing HTTP proxy server, said second request to said second HTTP proxy computing unit;

said second HTTP proxy computing unit selecting a second router from said plurality of routers based on a second routing table associating said destination IP address of said target website with said second router;

said second router sending said second request to said web server by utilizing a second instance of said NAT algorithm that associates a second plurality of source IP addresses with said corresponding HTTP proxy computing units, and that further associates a second source IP address of said second plurality of source IP addresses with said second HTTP proxy computing unit, wherein said second source IP address is different from said first source IP address based on said forwarding said first request to said first HTTP proxy computing unit, further based on said first HTTP proxy computing unit selecting said first router from said plurality of routers based on said first routing table associating said destination IP address with said first router and said first router sending said first request to said web server, still further based on said forwarding said second request to said randomly selected second HTTP proxy computing unit, and further yet based on said second HTTP proxy computing unit selecting said second router from said plurality of routers based on said second routing table associating said destination IP address with said second router and said second router sending said second request to said web server; and

a central processing unit (CPU) of said computer system preventing said web server from detecting said web crawling by presenting said first request and said second request to said web server as originating from different sources based on said first source IP address being different from said second source IP address.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for preventing a detection of web crawling. A randomizing HTTP proxy server receives a first request from a web crawler to scan a website and forwards the first request to a randomly selected first proxy computer. The first proxy computer utilizes a first network address translation (NAT)-enabled router to forward the first request to the website. A NAT algorithm associates a first source Internet Protocol (IP) address with the first request. The randomizing HTTP proxy server receives a second web crawler-initiated request to scan the website and forwards the second request to a randomly selected second proxy computer. The second proxy computer utilizes a second NAT-enabled router to forward the second request to the website. The NAT algorithm associates a second source IP address with the second request. The web server identifies the first and second source IP addresses as being different.

17 Citations

View as Search Results

22 Claims

1. A computer-implemented method of preventing a detection of web crawling, comprising:
- receiving, by a randomizing HTTP proxy server including a CPU coupled to a web crawling module and included in a computer system, a first request from said web crawling module to scan a target website provided by a web server that attempts to detect web crawling by identifying identical source Internet Protocol (IP) addresses of multiple requests to scan said target website and determining the number of said multiple requests to scan said target website exceeds a predefined threshold level, wherein said multiple requests include said first request and a second request from said web crawling module to scan said target website;
  
  forwarding, by said randomizing HTTP proxy server, said first request to a first HTTP proxy computing unit of a plurality of HTTP proxy computing units coupled to said randomizing HTTP proxy server via a network;
  
  said first HTTP proxy computing unit selecting a first router from a plurality of routers based on a first routing table associating a destination IP address of said target website with said first router;
  
  said first router sending said first request to said web server by utilizing a first instance of a network address translation (NAT) algorithm that associates a first plurality of source IP addresses with corresponding HTTP proxy computing units of said plurality of HTTP proxy computing units, and that further associates a first source IP address of said first plurality of source IP addresses with said first HTTP proxy computing unit;
  
  randomly selecting, by said randomizing HTTP proxy server, a second HTTP proxy computing unit of said plurality of HTTP proxy computing units, said second HTTP proxy computing unit being different from said first HTTP proxy computing unit;
  
  receiving, by said randomizing HTTP proxy server, said second request from said web crawling module to scan said target website;
  
  forwarding, by said randomizing HTTP proxy server, said second request to said second HTTP proxy computing unit;
  
  said second HTTP proxy computing unit selecting a second router from said plurality of routers based on a second routing table associating said destination IP address of said target website with said second router;
  
  said second router sending said second request to said web server by utilizing a second instance of said NAT algorithm that associates a second plurality of source IP addresses with said corresponding HTTP proxy computing units, and that further associates a second source IP address of said second plurality of source IP addresses with said second HTTP proxy computing unit, wherein said second source IP address is different from said first source IP address based on said forwarding said first request to said first HTTP proxy computing unit, further based on said first HTTP proxy computing unit selecting said first router from said plurality of routers based on said first routing table associating said destination IP address with said first router and said first router sending said first request to said web server, still further based on said forwarding said second request to said randomly selected second HTTP proxy computing unit, and further yet based on said second HTTP proxy computing unit selecting said second router from said plurality of routers based on said second routing table associating said destination IP address with said second router and said second router sending said second request to said web server; and
  
  a central processing unit (CPU) of said computer system preventing said web server from detecting said web crawling by presenting said first request and said second request to said web server as originating from different sources based on said first source IP address being different from said second source IP address.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, further comprising:
    - establishing a first Transfer Control Protocol (TCP) connection between said web crawling module and said randomizing HTTP proxy server;
      
      randomly selecting, by said randomizing HTTP proxy server and in response to said establishing said first TCP connection, said first HTTP proxy computing unit; and
      
      establishing, subsequent to said randomly selecting said first HTTP proxy computing unit, a second TCP connection between said randomizing HTTP proxy server and said first HTTP proxy computing unit.
  - 3. The method of claim 1, further comprising establishing, subsequent to said randomly selecting said second HTTP proxy computing unit, a TCP connection between said randomizing HTTP proxy server and said second HTTP proxy computing unit.
  - 4. The method of claim 1, further comprising:
    - receiving sets of publicly routable IP addresses, each set of publicly routable IP addresses received from a corresponding Internet service provider (ISP) of a plurality of ISPs;
      
      statically assigning non-routable IP addresses of a plurality of non-routable IP addresses in a specified range to corresponding HTTP proxy computing units of said plurality of HTTP proxy computing units, wherein said statically assigning non-routable IP addresses includes assigning a first non-routable IP address of said plurality of non-routable IP addresses to said first HTTP proxy computing unit and assigning a second non-routable IP address of said plurality of non-routable IP addresses to said second HTTP proxy computing unit;
      
      generating, within each HTTP proxy computing unit of said plurality of HTTP proxy computing units, a corresponding static routing table of a plurality of static routing tables, said generating including dividing an entire IP address space from IP address 1.0.0.0 to IP address 223.255.255.254 into L segments associated with corresponding routers of said plurality of routers;
      
      generating said first routing table as a static routing table corresponding to said first HTTP proxy computing unit; and
      
      generating said second routing table as a static routing table corresponding to said second HTTP proxy computing unit, wherein L is greater than M, wherein M is a total number of publicly routable IP addresses received from each ISP, and wherein M is also a number of HTTP proxy computing units included in said plurality of HTTP proxy computing units.
  - 5. The method of claim 4, further comprising:
    - mapping non-routable IP addresses of said plurality of non-routable IP addresses to publicly routable IP addresses in a first set of said sets of publicly routable IP addresses; and
      
      mapping said non-routable IP addresses to publicly routable IP addresses in a second set of said sets of publicly routable IP addresses,wherein said first source IP address is a publicly routable IP address in said first set and is mapped to said first non-routable IP address assigned to said first HTTP proxy computing unit, wherein said second source IP address is a publicly routable IP address in said second set and is mapped to said second non-routable IP address assigned to said second HTTP proxy computing unit,wherein said mapping said non-routable IP addresses to said publicly routable IP addresses in said first set includes using said first instance of said NAT algorithm, andwherein said mapping said non-routable IP addresses to said publicly routable IP addresses in said second set includes using said second instance of said NAT algorithm.
  - 6. The method of claim 4, further comprising:
    - identifying afirst segment of said L segments based on said destination IP address being included in said first segment, wherein said selecting said first router is further based on said first routing table associating said first segment with said first router.
  - 7. The method of claim 1, further comprising:
    - receiving, at said randomizing HTTP proxy server and subsequent to said forwarding said first request, a first response to said first request via a communication between said first HTTP proxy computing unit and said web server;
      
      sending, by said randomizing HTTP proxy server and in response to said receiving said first response, said first response to said web crawling module;
      
      receiving, at said randomizing HTTP proxy server and subsequent to said forwarding said second request, a second response to said second request via a communication between said second HTTP proxy computing unit and said web server; and
      
      sending, by said randomizing HTTP proxy server and in response to said receiving said second response, said second response to said web crawling module.

8. A computer program product for preventing a detection of web crawling in a computing environment, said computer program product comprising a computer-readable, tangible storage device having a computer-readable program code stored therein, said computer-readable program code containing instructions that are carried out by a processor of a computer system, said computer-readable program code comprising:
- computer-readable program code for receiving, by a randomizing HTTP proxy server including a CPU coupled to a web crawling module, a first request from said web crawling module to scan a target website provided by a web server that attempts to detect web crawling by identifying identical source Internet Protocol (IP) addresses of multiple requests to scan said target website and by determining the number of said multiple requests to scan said target website exceeds a predefined threshold level, wherein said multiple requests include said first request and a second request from said web crawling module to scan said target website;
  
  computer-readable program code for forwarding, by said randomizing HTTP proxy server, said first request to a first HTTP proxy computing unit of a plurality of HTTP proxy computing units coupled to said randomizing HTTP proxy server via a network;
  
  said first HTTP proxy computing unit selecting a first router from a plurality of routers based on a first routing table associating a destination IP address of said target website with said first router;
  
  said first router sending said first request to said web server by utilizing a first instance of a network address translation (NAT) algorithm that associates a first plurality of source IP addresses with corresponding HTTP proxy computing units of said plurality of HTTP proxy computing units, and that further associates a first source IP address of said first plurality of source IP addresses with said first HTTP proxy computing unit;
  
  computer-readable program code for randomly selecting, by said randomizing HTTP proxy server, a second HTTP proxy computing unit of said plurality of HTTP proxy computing units, said second HTTP proxy computing unit being different from said first HTTP proxy computing unit;
  
  computer-readable program code for receiving, by said randomizing HTTP proxy server, said second request from said web crawling module to scan said target website;
  
  computer-readable program code for forwarding, by said randomizing HTTP proxy server, said second request to said second HTTP proxy computing unit;
  
  said second HTTP proxy computing unit selecting a second router from said plurality of routers based on a second routing table associating said destination IP address of said target website with said second router;
  
  said second router sending said second request to said web server by utilizing a second instance of said NAT algorithm that associates a second plurality of source IP addresses with said corresponding HTTP proxy computing units, and that further associates a second source IP address of said second plurality of source IP addresses with said second HTTP proxy computing unit, wherein said second source IP address is different from said first source IP address based on said forwarding said first request to said first HTTP proxy computing unit, further based on said first HTTP proxy computing unit selecting said first router from said plurality of routers based on said first routing table associating said destination IP address with said first router and said first router sending said first request to said web server, still further based on said forwarding said second request to said randomly selected second HTTP proxy computing unit, and further yet based on said second HTTP proxy computing unit selecting said second router from said plurality of routers based on said second routing table associating said destination IP address with said second router and said second router sending said second request to said web server; and
  
  computer-readable program code for preventing said web server from detecting said web crawling by presenting said first request and said second request to said web server as originating from different sources based on said first source IP address being different from said second source IP address.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The program product of claim 8, wherein said computer-readable program code further comprises:
    - computer-readable program code for establishing a first Transfer Control Protocol (TCP) connection between said web crawling module and said randomizing HTTP proxy server;
      
      computer-readable program code for randomly selecting, by said randomizing HTTP proxy server and in response to said establishing said first TCP connection, said first HTTP proxy computing unit; and
      
      computer-readable program code for establishing, subsequent to said randomly selecting said first HTTP proxy computing unit, a second TCP connection between said randomizing HTTP proxy server and said first HTTP proxy computing unit.
  - 10. The program product of claim 8, wherein said computer-readable program code further comprises computer-readable program code for establishing, subsequent to said randomly selecting said second HTTP proxy computing unit, a TCP connection between said randomizing HTTP proxy server and said second HTTP proxy computing unit.
  - 11. The program product of claim 8, wherein said computer-readable program code further comprises:
    - computer-readable program code for receiving sets of publicly routable IP addresses, each set of publicly routable IP addresses received from a corresponding Internet service provider (ISP) of a plurality of ISPs;
      
      computer-readable program code for statically assigning non-routable IP addresses of a plurality of non-routable IP addresses in a specified range to corresponding HTTP proxy computing units of said plurality of HTTP proxy computing units, wherein said computer-readable program code for statically assigning non-routable IP addresses includes computer-readable code for assigning a first non-routable IP address of said plurality of non-routable IP addresses to said first HTTP proxy computing unit and assigning a second non-routable IP address of said plurality of non-routable IP addresses to said second HTTP proxy computing unit;
      
      computer-readable program code for generating, within each HTTP proxy computing unit of said plurality of HTTP proxy computing units, a corresponding static routing table of a plurality of static routing tables, said computer-readable program code for generating including computer-readable program code for dividing an entire IP address space from IP address 1.0.0.0 to IP address 223.255.255.254 into L segments associated with corresponding routers of said plurality of routers;
      
      computer-readable program code for generating said first routing table as a static routing table corresponding to said first HTTP proxy computing unit; and
      
      computer-readable code for generating said second routing table as a static routing table corresponding to said second HTTP proxy computing unit, wherein L is greater than M, wherein M is a total number of publicly routable IP addresses received from each ISP, and wherein M is also a number of HTTP proxy computing units included in said plurality of HTTP proxy computing units.
  - 12. The program product of claim 11, wherein said computer-readable program code further comprises:
    - computer-readable program code for mapping non-routable IP addresses of said plurality of non-routable IP addresses to publicly routable IP addresses in a first set of said sets of publicly routable IP addresses; and
      
      computer-readable program code for mapping said non-routable IP addresses to publicly routable IP addresses in a second set of said sets of publicly routable IP addresses,wherein said first source IP address is a publicly routable IP address in said first set and is mapped to said first non-routable IP address assigned to said first HTTP proxy computing unit, wherein said second source IP address is a publicly routable IP address in said second set and is mapped to said second non-routable IP address assigned to said second HTTP proxy computing unit,wherein said computer-readable program code for mapping said non-routable IP addresses to said publicly routable IP addresses in said first set includes computer-readable program code for using said first instance of said NAT algorithm, andwherein said computer-readable program code for mapping said non-routable IP addresses to said publicly routable IP addresses in said second set includes computer-readable program code for using said second instance of said NAT algorithm.
  - 13. The program product of claim 11, wherein said computer-readable program code further comprises:
    - computer-readable program code for identifying afirst segment of said L segments based on said destination IP address being included in said first segment, wherein said selecting said first router is further based on said first routing table associating said first segment with said first router.
  - 14. The program product of claim 8, wherein said computer-readable program code further comprises:
    - computer-readable program code for receiving, at said randomizing HTTP proxy server and subsequent to said forwarding said first request, a first response to said first request via a communication between said first HTTP proxy computing unit and said web server;
      
      computer-readable program code for sending, by said randomizing HTTP proxy server and in response to said receiving said first response, said first response to said web crawling module;
      
      computer-readable program code for receiving, at said randomizing HTTP proxy server and subsequent to said forwarding said second request, a second response to said second request via a communication between said second HTTP proxy computing unit and said web server; and
      
      computer-readable program code for sending, by said randomizing HTTP proxy server and in response to said receiving said second response, said second response to said web crawling module.

15. A process for supporting computing infrastructure, said process comprising providing at least one support service for at least one of creating, integrating, hosting, maintaining, and deploying computer-readable code in a computing system, wherein the code in combination with the computing system is capable of performing a method of preventing a detection of web crawling in a computing environment, comprising:
- receiving, by a randomizing HTTP proxy server including a CPU coupled to a web crawling module and included in said computing system, a first request from said web crawling module to scan a target website provided by a web server that attempts to detect web crawling by identifying identical source Internet Protocol (IP) addresses of multiple requests to scan said target website and determining the number of said multiple requests to scan said target website exceeds a predefined threshold level, wherein said multiple requests include said first request and a second request from said web crawling module to scan said target website;
  
  forwarding, by said randomizing HTTP proxy server, said first request to a first HTTP proxy computing unit of a plurality of HTTP proxy computing units coupled to said randomizing HTTP proxy server via a network;
  
  said first HTTP proxy computing unit selecting a first router from a plurality of routers based on a first routing table associating a destination IP address of said target website with said first router;
  
  said first router sending said first request to said web server by utilizing a first instance of a network address translation (NAT) algorithm that associates a first plurality of source IP addresses with corresponding HTTP proxy computing units of said plurality of HTTP proxy computing units, and that further associates a first source IP address of said first plurality of source IP addresses with said first HTTP proxy computing unit;
  
  randomly selecting, by said randomizing HTTP proxy server, a second HTTP proxy computing unit of said plurality of HTTP proxy computing units, said second HTTP proxy computing unit being different from said first HTTP proxy computing unit;
  
  receiving, by said randomizing HTTP proxy server, said second request from said web crawling module to scan said target website; and
  
  forwarding, by said randomizing HTTP proxy server, said second request to said second HTTP proxy computing unit;
  
  said second HTTP proxy computing unit selecting a second router from said plurality of routers based on a second routing table associating said destination IP address of said target website with said second router;
  
  said second router sending said second request to said web server by utilizing a second instance of said NAT algorithm that associates a second plurality of source IP addresses with said corresponding HTTP proxy computing units, and that further associates a second source IP address of said second plurality of source IP addresses with said second HTTP proxy computing unit, wherein said second source IP address is different from said first source IP address based on said forwarding said first request to said first HTTP proxy computing unit, further based on said first HTTP proxy computing unit selecting said first router from said plurality of routers based on said first routing table associating said destination IP address with said first router and said first router sending said first request to said web server, still further based on said forwarding said second request to said randomly selected second HTTP proxy computing unit, and further yet based on said second HTTP proxy computing unit selecting said second router from said plurality of routers based on said second routing table associating said destination IP address with said second router and said second router sending said second request to said web server; and
  
  a central processing unit (CPU) of said computing system preventing said web server from detecting said web crawling by presenting said first request and said second request to said web server as originating from different sources based on said first source IP address being different from said second source IP address.
- View Dependent Claims (16, 17, 18, 19, 20, 21)
- - 16. The process of claim 15, wherein said method further comprises:
    - establishing a first Transfer Control Protocol (TCP) connection between said web crawling module and said randomizing HTTP proxy server;
      
      randomly selecting, by said randomizing HTTP proxy server and in response to said establishing said first TCP connection, said first HTTP proxy computing unit; and
      
      establishing, subsequent to said randomly selecting said first HTTP proxy computing unit, a second TCP connection between said randomizing HTTP proxy server and said first HTTP proxy computing unit.
  - 17. The process of claim 15, wherein said method further comprises establishing, subsequent to said randomly selecting said second HTTP proxy computing unit, a TCP connection between said randomizing HTTP proxy server and said second HTTP proxy computing unit.
  - 18. The process of claim 15, wherein said method further comprises:
    - receiving sets of publicly routable IP addresses, each set of publicly routable IP addresses received from a corresponding Internet service provider (ISP) of a plurality of ISPs;
      
      statically assigning non-routable IP addresses of a plurality of non-routable IP addresses in a specified range to corresponding HTTP proxy computing units of said plurality of HTTP proxy computing units, wherein said statically assigning non-routable IP addresses includes assigning a first non-routable IP address of said plurality of non-routable IP addresses to said first HTTP proxy computing unit and assigning a second non-routable IP address of said plurality of non-routable IP addresses to said second HTTP proxy computing unit;
      
      generating, within each HTTP proxy computing unit of said plurality of HTTP proxy computing units, a corresponding static routing table of a plurality of static routing tables, said generating including dividing an entire IP address space from IP address 1.0.0.0 to IP address 223.255.255.254 into L segments associated with corresponding routers of said plurality of routers;
      
      generating said first routing table as a static routing table corresponding to said first HTTP proxy computing unit; and
      
      generating said second routing table as a static routing table corresponding to said second HTTP proxy computing unit, wherein L is greater than M, wherein M is a total number of publicly routable IP addresses received from each ISP, and wherein M is also a number of HTTP proxy computing units included in said plurality of HTTP proxy computing units.
  - 19. The process of claim 18, wherein said method further comprises:
    - mapping non-routable IP addresses of said plurality of non-routable IP addresses to publicly routable IP addresses in a first set of said sets of publicly routable IP addresses; and
      
      mapping said non-routable IP addresses to publicly routable IP addresses in a second set of said sets of publicly routable IP addresses,wherein said first source IP address is a publicly routable IP address in said first set and is mapped to said first non-routable IP address assigned to said first HTTP proxy computing unit, wherein said second source IP address is a publicly routable IP address in said second set and is mapped to said second non-routable IP address assigned to said second HTTP proxy computing unit,wherein said mapping said non-routable IP addresses to said publicly routable IP addresses in said first set includes using said first instance of said NAT algorithm, andwherein said mapping said non-routable IP addresses to said publicly routable IP addresses in said second set includes using said second instance of said NAT algorithm.
  - 20. The process of claim 18, wherein said method further comprises:
    - identifying afirst segment of said L segments based on said destination IP address being included in said first segment, wherein said selecting said first router is further based on said first routing table associating said first segment with said first router.
  - 21. The process of claim 15, wherein said method further comprises:
    - receiving, at said randomizing HTTP proxy server and subsequent to said forwarding said first request, a first response to said first request via a communication between said first HTTP proxy computing unit and said web server;
      
      sending, by said randomizing HTTP proxy server and in response to said receiving said first response, said first response to said web crawling module;
      
      receiving, at said randomizing HTTP proxy server and subsequent to said forwarding said second request, a second response to said second request via a communication between said second HTTP proxy computing unit and said web server; and
      
      sending, by said randomizing HTTP proxy server and in response to said receiving said second response, said second response to said web crawling module.

22. A computer-implemented method of preventing a detection of web crawling, comprising:
- a computer system receiving a plurality of static routing tables corresponding to a plurality of Hypertext Transfer Protocol (HTTP) proxy computing units included in said computer system, and for each static routing table of said plurality of static routing tables, randomly assigning routers included in said computer system to segments of an entire Internet Protocol (IP) address space, wherein said segments are included in each static routing table, wherein said randomly assigning routers includes assigning same segments of said entire IP address space to different routers of said randomly assigned routers in different static routing tables of said plurality of static routing tables, wherein said different static routing tables includes a first static routing table that assigns a first segment of said entire IP address space to a first router of said routers and further includes a second static routing table that assigns a second segment of said entire IP address space to a second router of said routers, wherein said routers are coupled to said plurality of HTTP proxy computing units via a first network, and wherein said first and second static routing tables of said plurality of static routing tables are included in first and second HTTP proxy computing units of said plurality of HTTP proxy computing units, respectively;
  
  said computer system including a randomizing HTTP proxy server including a CPU randomly selecting said first HTTP proxy computing unit of said plurality of HTTP proxy computing units coupled to said randomizing HTTP proxy server via a second network, wherein said randomizing HTTP proxy server is coupled to a web crawling module included in said computer system;
  
  said randomizing HTTP proxy server receiving a first request from said web crawling module to scan a target website provided by a web server that attempts to detect web crawling by identifying identical source IP addresses of multiple requests to scan said target website and determining the number of said multiple requests to scan said target website exceeds a predefined threshold level, wherein said multiple requests include said first request and a second request from said web crawling module to scan said target website;
  
  forwarding, by said randomizing HTTP proxy server, said first request to said first HTTP proxy computing unit;
  
  said first HTTP proxy computing unit selecting a first router of said routers based on said first static routing table included in said first HTTP proxy computing unit having said first segment of said entire IP address space assigned to said first router and further based on a destination IP address of said target website being included in said first segment of said entire IP address space;
  
  said first router obtaining a first source IP address of a first plurality of source IP addresses based on a first network address translation (NAT) table included in said first router, wherein said first NAT table includes first associations of said first plurality of source IP addresses with corresponding HTTP proxy computing units of said plurality of HTTP proxy computing units, and wherein said first associations include an association of said first source IP address with said first HTTP proxy computing unit;
  
  said first router sending said first request to said web server so that said first request is presented to said web server as originating from said first source IP address;
  
  randomly selecting, by said randomizing HTTP proxy server, said second HTTP proxy computing unit of said plurality of HTTP proxy computing units, said second HTTP proxy computing unit being different from said first HTTP proxy computing unit;
  
  receiving, by said randomizing HTTP proxy server, said second request from said web crawling module to scan said target website;
  
  forwarding, by said randomizing HTTP proxy server, said second request to said second HTTP proxy computing unit;
  
  said second HTTP proxy computing unit selecting a second router of said routers based on said second static routing table included in said second HTTP proxy computing unit having said first segment of said entire IP address space assigned to said second router and further based on said destination IP address of said target website being included in said first segment of said entire IP address space;
  
  said second router obtaining a second source IP address of a second plurality of source IP addresses based on a second NAT table included in said second router, wherein said second NAT table includes second associations of said second plurality of source IP addresses with said corresponding HTTP proxy computing units of said plurality of HTTP proxy computing units, and wherein said second associations include an association of said second source IP address with said second HTTP proxy computing unit;
  
  said second router sending said second request to said web server so that said second request is presented to said web server as originating from said second source IP address; and
  
  said computer system preventing said web server from detecting said web crawling by presenting said first request and said second request to said web server as originating from different sources based on said first source IP address being different from said second source IP address, wherein said first source IP address is different from said second source IP address based on said forwarding said first request to said randomly selected first HTTP proxy computing unit, further based on said first HTTP proxy computing unit selecting said first router based on said first static routing table that assigns said first segment of said entire IP address space to said first router, still further based on said forwarding said second request to said randomly selected second HTTP proxy computing unit, and further yet based on said second HTTP proxy computing unit selecting said second router based on said second static routing table that assigns said second segment of said entire IP address space to said second router.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Rakuten Group, Inc.
Original Assignee
International Business Machines Corporation
Inventors
Andreev, Dmitry, Grunin, Galina, Vilshansky, Gregory
Primary Examiner(s)
Patel; Ashok B
Assistant Examiner(s)
POTRATZ, DANIEL B

Application Number

US11/669,322
Publication Number

US 20080183889A1
Time in Patent Office

1,581 Days
Field of Search

709/245, 709/238, 709/220, 709/223, 709/228, 709/235, 726/12, 726/22, 726/23, 707/709, 707/710
US Class Current

709/228
CPC Class Codes

H04L 61/2503   Translation of Internet pro...

H04L 63/0407   wherein the identity of one...

H04L 63/30   for supporting lawful inter...

H04L 67/02   based on web technology, e....

H04L 67/563   Data redirection of data ne...

H04L 9/40   Network security protocols

Method and system for preventing web crawling detection

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

17 Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for preventing web crawling detection

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

17 Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links