Systems and methods of handling internet spiders
First Claim
1. A method for implementation on one or more computers, comprising:
- accepting, at one or more of the one or more computers, requests for resources, each resource identified by a respective URL resolvable to a common domain;
collecting, at one or more of the one or more computers, identifying information for entities making the requests;
comparing, at one or more of the one or more computers, the identifying information for the entities making the requests; and
categorizing, at one or more of the one or more computers, one or more of the entities as a web crawler, based on determining whether at least a threshold number of requests for the identical URL were made from a single entity within a time limit, and responsive to the determining returning an affirmative result, categorizing the entity that made those requests as a web crawler.
2 Assignments
0 Petitions
Accused Products
Abstract
Aspects relate to identifying Internet spiders with an approach involving a plurality of instances of one or more URLs, which reference resources available from a first domain. Instances of the URLs are distributed at other Internet domains. Spiders crawling those domains will activate those URL instances, resulting in requests for the resources referenced by the URLs. A generator of a number of requests for the same resource, from a potential multitude of URL instances, can cause the generator to be categorized as a spider. Similarly, a generator of a number of requests for resources identified by different URLs also can be categorized as spider behavior. In some cases, the first domain may not have a browseable site infrastructure with, such that a spider would not readily crawl it by following internal links. The URLs can refer to custom queries created by various users, who can provide the URLs on their pages, such as on social networking sites.
49 Citations
2 Claims
-
1. A method for implementation on one or more computers, comprising:
-
accepting, at one or more of the one or more computers, requests for resources, each resource identified by a respective URL resolvable to a common domain; collecting, at one or more of the one or more computers, identifying information for entities making the requests; comparing, at one or more of the one or more computers, the identifying information for the entities making the requests; and categorizing, at one or more of the one or more computers, one or more of the entities as a web crawler, based on determining whether at least a threshold number of requests for the identical URL were made from a single entity within a time limit, and responsive to the determining returning an affirmative result, categorizing the entity that made those requests as a web crawler. - View Dependent Claims (2)
-
Specification