Method and system for preventing web crawling detection
First Claim
1. A computer-implemented method of preventing a detection of web crawling, comprising:
- receiving, by a randomizing HTTP proxy server including a CPU coupled to a web crawling module and included in a computer system, a first request from said web crawling module to scan a target website provided by a web server that attempts to detect web crawling by identifying identical source Internet Protocol (IP) addresses of multiple requests to scan said target website and determining the number of said multiple requests to scan said target website exceeds a predefined threshold level, wherein said multiple requests include said first request and a second request from said web crawling module to scan said target website;
forwarding, by said randomizing HTTP proxy server, said first request to a first HTTP proxy computing unit of a plurality of HTTP proxy computing units coupled to said randomizing HTTP proxy server via a network;
said first HTTP proxy computing unit selecting a first router from a plurality of routers based on a first routing table associating a destination IP address of said target website with said first router;
said first router sending said first request to said web server by utilizing a first instance of a network address translation (NAT) algorithm that associates a first plurality of source IP addresses with corresponding HTTP proxy computing units of said plurality of HTTP proxy computing units, and that further associates a first source IP address of said first plurality of source IP addresses with said first HTTP proxy computing unit;
randomly selecting, by said randomizing HTTP proxy server, a second HTTP proxy computing unit of said plurality of HTTP proxy computing units, said second HTTP proxy computing unit being different from said first HTTP proxy computing unit;
receiving, by said randomizing HTTP proxy server, said second request from said web crawling module to scan said target website;
forwarding, by said randomizing HTTP proxy server, said second request to said second HTTP proxy computing unit;
said second HTTP proxy computing unit selecting a second router from said plurality of routers based on a second routing table associating said destination IP address of said target website with said second router;
said second router sending said second request to said web server by utilizing a second instance of said NAT algorithm that associates a second plurality of source IP addresses with said corresponding HTTP proxy computing units, and that further associates a second source IP address of said second plurality of source IP addresses with said second HTTP proxy computing unit, wherein said second source IP address is different from said first source IP address based on said forwarding said first request to said first HTTP proxy computing unit, further based on said first HTTP proxy computing unit selecting said first router from said plurality of routers based on said first routing table associating said destination IP address with said first router and said first router sending said first request to said web server, still further based on said forwarding said second request to said randomly selected second HTTP proxy computing unit, and further yet based on said second HTTP proxy computing unit selecting said second router from said plurality of routers based on said second routing table associating said destination IP address with said second router and said second router sending said second request to said web server; and
a central processing unit (CPU) of said computer system preventing said web server from detecting said web crawling by presenting said first request and said second request to said web server as originating from different sources based on said first source IP address being different from said second source IP address.
3 Assignments
0 Petitions
Accused Products
Abstract
A method and system for preventing a detection of web crawling. A randomizing HTTP proxy server receives a first request from a web crawler to scan a website and forwards the first request to a randomly selected first proxy computer. The first proxy computer utilizes a first network address translation (NAT)-enabled router to forward the first request to the website. A NAT algorithm associates a first source Internet Protocol (IP) address with the first request. The randomizing HTTP proxy server receives a second web crawler-initiated request to scan the website and forwards the second request to a randomly selected second proxy computer. The second proxy computer utilizes a second NAT-enabled router to forward the second request to the website. The NAT algorithm associates a second source IP address with the second request. The web server identifies the first and second source IP addresses as being different.
17 Citations
22 Claims
-
1. A computer-implemented method of preventing a detection of web crawling, comprising:
-
receiving, by a randomizing HTTP proxy server including a CPU coupled to a web crawling module and included in a computer system, a first request from said web crawling module to scan a target website provided by a web server that attempts to detect web crawling by identifying identical source Internet Protocol (IP) addresses of multiple requests to scan said target website and determining the number of said multiple requests to scan said target website exceeds a predefined threshold level, wherein said multiple requests include said first request and a second request from said web crawling module to scan said target website; forwarding, by said randomizing HTTP proxy server, said first request to a first HTTP proxy computing unit of a plurality of HTTP proxy computing units coupled to said randomizing HTTP proxy server via a network;
said first HTTP proxy computing unit selecting a first router from a plurality of routers based on a first routing table associating a destination IP address of said target website with said first router;
said first router sending said first request to said web server by utilizing a first instance of a network address translation (NAT) algorithm that associates a first plurality of source IP addresses with corresponding HTTP proxy computing units of said plurality of HTTP proxy computing units, and that further associates a first source IP address of said first plurality of source IP addresses with said first HTTP proxy computing unit;randomly selecting, by said randomizing HTTP proxy server, a second HTTP proxy computing unit of said plurality of HTTP proxy computing units, said second HTTP proxy computing unit being different from said first HTTP proxy computing unit; receiving, by said randomizing HTTP proxy server, said second request from said web crawling module to scan said target website; forwarding, by said randomizing HTTP proxy server, said second request to said second HTTP proxy computing unit;
said second HTTP proxy computing unit selecting a second router from said plurality of routers based on a second routing table associating said destination IP address of said target website with said second router;
said second router sending said second request to said web server by utilizing a second instance of said NAT algorithm that associates a second plurality of source IP addresses with said corresponding HTTP proxy computing units, and that further associates a second source IP address of said second plurality of source IP addresses with said second HTTP proxy computing unit, wherein said second source IP address is different from said first source IP address based on said forwarding said first request to said first HTTP proxy computing unit, further based on said first HTTP proxy computing unit selecting said first router from said plurality of routers based on said first routing table associating said destination IP address with said first router and said first router sending said first request to said web server, still further based on said forwarding said second request to said randomly selected second HTTP proxy computing unit, and further yet based on said second HTTP proxy computing unit selecting said second router from said plurality of routers based on said second routing table associating said destination IP address with said second router and said second router sending said second request to said web server; anda central processing unit (CPU) of said computer system preventing said web server from detecting said web crawling by presenting said first request and said second request to said web server as originating from different sources based on said first source IP address being different from said second source IP address. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer program product for preventing a detection of web crawling in a computing environment, said computer program product comprising a computer-readable, tangible storage device having a computer-readable program code stored therein, said computer-readable program code containing instructions that are carried out by a processor of a computer system, said computer-readable program code comprising:
-
computer-readable program code for receiving, by a randomizing HTTP proxy server including a CPU coupled to a web crawling module, a first request from said web crawling module to scan a target website provided by a web server that attempts to detect web crawling by identifying identical source Internet Protocol (IP) addresses of multiple requests to scan said target website and by determining the number of said multiple requests to scan said target website exceeds a predefined threshold level, wherein said multiple requests include said first request and a second request from said web crawling module to scan said target website; computer-readable program code for forwarding, by said randomizing HTTP proxy server, said first request to a first HTTP proxy computing unit of a plurality of HTTP proxy computing units coupled to said randomizing HTTP proxy server via a network;
said first HTTP proxy computing unit selecting a first router from a plurality of routers based on a first routing table associating a destination IP address of said target website with said first router;
said first router sending said first request to said web server by utilizing a first instance of a network address translation (NAT) algorithm that associates a first plurality of source IP addresses with corresponding HTTP proxy computing units of said plurality of HTTP proxy computing units, and that further associates a first source IP address of said first plurality of source IP addresses with said first HTTP proxy computing unit;computer-readable program code for randomly selecting, by said randomizing HTTP proxy server, a second HTTP proxy computing unit of said plurality of HTTP proxy computing units, said second HTTP proxy computing unit being different from said first HTTP proxy computing unit; computer-readable program code for receiving, by said randomizing HTTP proxy server, said second request from said web crawling module to scan said target website; computer-readable program code for forwarding, by said randomizing HTTP proxy server, said second request to said second HTTP proxy computing unit;
said second HTTP proxy computing unit selecting a second router from said plurality of routers based on a second routing table associating said destination IP address of said target website with said second router;
said second router sending said second request to said web server by utilizing a second instance of said NAT algorithm that associates a second plurality of source IP addresses with said corresponding HTTP proxy computing units, and that further associates a second source IP address of said second plurality of source IP addresses with said second HTTP proxy computing unit, wherein said second source IP address is different from said first source IP address based on said forwarding said first request to said first HTTP proxy computing unit, further based on said first HTTP proxy computing unit selecting said first router from said plurality of routers based on said first routing table associating said destination IP address with said first router and said first router sending said first request to said web server, still further based on said forwarding said second request to said randomly selected second HTTP proxy computing unit, and further yet based on said second HTTP proxy computing unit selecting said second router from said plurality of routers based on said second routing table associating said destination IP address with said second router and said second router sending said second request to said web server; andcomputer-readable program code for preventing said web server from detecting said web crawling by presenting said first request and said second request to said web server as originating from different sources based on said first source IP address being different from said second source IP address. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A process for supporting computing infrastructure, said process comprising providing at least one support service for at least one of creating, integrating, hosting, maintaining, and deploying computer-readable code in a computing system, wherein the code in combination with the computing system is capable of performing a method of preventing a detection of web crawling in a computing environment, comprising:
-
receiving, by a randomizing HTTP proxy server including a CPU coupled to a web crawling module and included in said computing system, a first request from said web crawling module to scan a target website provided by a web server that attempts to detect web crawling by identifying identical source Internet Protocol (IP) addresses of multiple requests to scan said target website and determining the number of said multiple requests to scan said target website exceeds a predefined threshold level, wherein said multiple requests include said first request and a second request from said web crawling module to scan said target website; forwarding, by said randomizing HTTP proxy server, said first request to a first HTTP proxy computing unit of a plurality of HTTP proxy computing units coupled to said randomizing HTTP proxy server via a network;
said first HTTP proxy computing unit selecting a first router from a plurality of routers based on a first routing table associating a destination IP address of said target website with said first router;
said first router sending said first request to said web server by utilizing a first instance of a network address translation (NAT) algorithm that associates a first plurality of source IP addresses with corresponding HTTP proxy computing units of said plurality of HTTP proxy computing units, and that further associates a first source IP address of said first plurality of source IP addresses with said first HTTP proxy computing unit;randomly selecting, by said randomizing HTTP proxy server, a second HTTP proxy computing unit of said plurality of HTTP proxy computing units, said second HTTP proxy computing unit being different from said first HTTP proxy computing unit; receiving, by said randomizing HTTP proxy server, said second request from said web crawling module to scan said target website; and forwarding, by said randomizing HTTP proxy server, said second request to said second HTTP proxy computing unit;
said second HTTP proxy computing unit selecting a second router from said plurality of routers based on a second routing table associating said destination IP address of said target website with said second router;
said second router sending said second request to said web server by utilizing a second instance of said NAT algorithm that associates a second plurality of source IP addresses with said corresponding HTTP proxy computing units, and that further associates a second source IP address of said second plurality of source IP addresses with said second HTTP proxy computing unit, wherein said second source IP address is different from said first source IP address based on said forwarding said first request to said first HTTP proxy computing unit, further based on said first HTTP proxy computing unit selecting said first router from said plurality of routers based on said first routing table associating said destination IP address with said first router and said first router sending said first request to said web server, still further based on said forwarding said second request to said randomly selected second HTTP proxy computing unit, and further yet based on said second HTTP proxy computing unit selecting said second router from said plurality of routers based on said second routing table associating said destination IP address with said second router and said second router sending said second request to said web server; anda central processing unit (CPU) of said computing system preventing said web server from detecting said web crawling by presenting said first request and said second request to said web server as originating from different sources based on said first source IP address being different from said second source IP address. - View Dependent Claims (16, 17, 18, 19, 20, 21)
-
-
22. A computer-implemented method of preventing a detection of web crawling, comprising:
-
a computer system receiving a plurality of static routing tables corresponding to a plurality of Hypertext Transfer Protocol (HTTP) proxy computing units included in said computer system, and for each static routing table of said plurality of static routing tables, randomly assigning routers included in said computer system to segments of an entire Internet Protocol (IP) address space, wherein said segments are included in each static routing table, wherein said randomly assigning routers includes assigning same segments of said entire IP address space to different routers of said randomly assigned routers in different static routing tables of said plurality of static routing tables, wherein said different static routing tables includes a first static routing table that assigns a first segment of said entire IP address space to a first router of said routers and further includes a second static routing table that assigns a second segment of said entire IP address space to a second router of said routers, wherein said routers are coupled to said plurality of HTTP proxy computing units via a first network, and wherein said first and second static routing tables of said plurality of static routing tables are included in first and second HTTP proxy computing units of said plurality of HTTP proxy computing units, respectively; said computer system including a randomizing HTTP proxy server including a CPU randomly selecting said first HTTP proxy computing unit of said plurality of HTTP proxy computing units coupled to said randomizing HTTP proxy server via a second network, wherein said randomizing HTTP proxy server is coupled to a web crawling module included in said computer system; said randomizing HTTP proxy server receiving a first request from said web crawling module to scan a target website provided by a web server that attempts to detect web crawling by identifying identical source IP addresses of multiple requests to scan said target website and determining the number of said multiple requests to scan said target website exceeds a predefined threshold level, wherein said multiple requests include said first request and a second request from said web crawling module to scan said target website; forwarding, by said randomizing HTTP proxy server, said first request to said first HTTP proxy computing unit; said first HTTP proxy computing unit selecting a first router of said routers based on said first static routing table included in said first HTTP proxy computing unit having said first segment of said entire IP address space assigned to said first router and further based on a destination IP address of said target website being included in said first segment of said entire IP address space; said first router obtaining a first source IP address of a first plurality of source IP addresses based on a first network address translation (NAT) table included in said first router, wherein said first NAT table includes first associations of said first plurality of source IP addresses with corresponding HTTP proxy computing units of said plurality of HTTP proxy computing units, and wherein said first associations include an association of said first source IP address with said first HTTP proxy computing unit; said first router sending said first request to said web server so that said first request is presented to said web server as originating from said first source IP address; randomly selecting, by said randomizing HTTP proxy server, said second HTTP proxy computing unit of said plurality of HTTP proxy computing units, said second HTTP proxy computing unit being different from said first HTTP proxy computing unit; receiving, by said randomizing HTTP proxy server, said second request from said web crawling module to scan said target website; forwarding, by said randomizing HTTP proxy server, said second request to said second HTTP proxy computing unit; said second HTTP proxy computing unit selecting a second router of said routers based on said second static routing table included in said second HTTP proxy computing unit having said first segment of said entire IP address space assigned to said second router and further based on said destination IP address of said target website being included in said first segment of said entire IP address space; said second router obtaining a second source IP address of a second plurality of source IP addresses based on a second NAT table included in said second router, wherein said second NAT table includes second associations of said second plurality of source IP addresses with said corresponding HTTP proxy computing units of said plurality of HTTP proxy computing units, and wherein said second associations include an association of said second source IP address with said second HTTP proxy computing unit; said second router sending said second request to said web server so that said second request is presented to said web server as originating from said second source IP address; and said computer system preventing said web server from detecting said web crawling by presenting said first request and said second request to said web server as originating from different sources based on said first source IP address being different from said second source IP address, wherein said first source IP address is different from said second source IP address based on said forwarding said first request to said randomly selected first HTTP proxy computing unit, further based on said first HTTP proxy computing unit selecting said first router based on said first static routing table that assigns said first segment of said entire IP address space to said first router, still further based on said forwarding said second request to said randomly selected second HTTP proxy computing unit, and further yet based on said second HTTP proxy computing unit selecting said second router based on said second static routing table that assigns said second segment of said entire IP address space to said second router.
-
Specification