Identifying internet protocol addresses for internet hosting entities
First Claim
1. A system comprising one or more computers programmed to perform operations comprising:
- maintaining an Internet Protocol (IP) address history for each hostname in a plurality of hostnames, where each IP address history is a time series of IP addresses;
organizing the hostnames into a collection of groups so that each hostname of the plurality of hostnames is a member of exactly one group in the collection of groups, where each group has a kernel calculated from the IP address histories of the members of the group, and where the IP address history of each member of the group is within a threshold distance of the kernel of the group;
providing to a crawler, for use in scheduling a crawl of the plurality of hostnames, data describing the collection of groups;
receiving an update to an IP address history for a first hostname of the plurality of hostnames, the first hostname being a member of a first group of the collection of groups, and recalculating a first kernel of the first group using the updated IP address history of the first hostname;
receiving an update to an IP address history for a second hostname, the second hostname being a member of a second group of the collection of groups, and recalculating a second kernel of the second group using the updated IP address history of the second hostname; and
determining that the first kernel is within the threshold distance of the second kernel and, as a result, merging the first group and the second group into a single group in the collection of groups.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for identifying web hosting entities. In one aspect, a system includes one or more computers programmed to perform operations including maintaining an Internet Protocol (IP) address history for each hostname in a plurality of hostnames. Each IP address history is a time series of IP addresses. The operations further include organizing the hostnames into a collection of groups so that each hostname of the plurality of hostnames is a member of exactly one group in the collection of groups. Each group has a kernel calculated from the IP address histories of the members of the group, and the IP address history of each member of the group is within a threshold distance of the kernel of the group.
21 Citations
22 Claims
-
1. A system comprising one or more computers programmed to perform operations comprising:
-
maintaining an Internet Protocol (IP) address history for each hostname in a plurality of hostnames, where each IP address history is a time series of IP addresses; organizing the hostnames into a collection of groups so that each hostname of the plurality of hostnames is a member of exactly one group in the collection of groups, where each group has a kernel calculated from the IP address histories of the members of the group, and where the IP address history of each member of the group is within a threshold distance of the kernel of the group; providing to a crawler, for use in scheduling a crawl of the plurality of hostnames, data describing the collection of groups; receiving an update to an IP address history for a first hostname of the plurality of hostnames, the first hostname being a member of a first group of the collection of groups, and recalculating a first kernel of the first group using the updated IP address history of the first hostname; receiving an update to an IP address history for a second hostname, the second hostname being a member of a second group of the collection of groups, and recalculating a second kernel of the second group using the updated IP address history of the second hostname; and determining that the first kernel is within the threshold distance of the second kernel and, as a result, merging the first group and the second group into a single group in the collection of groups. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computer-implemented method comprising:
-
maintaining an Internet Protocol (IP) address history for each hostname in a plurality of hostnames, where each IP address history is a time series of IP addresses; organizing the hostnames into a collection of groups so that each hostname of the plurality of hostnames is a member of exactly one group in the collection of groups, where each group has a kernel calculated from the IP address histories of the members of the group, and where the IP address history of each member of the group is within a threshold distance of the kernel of the group; providing to a crawler, for use in scheduling a crawl of the plurality of hostnames, data describing the collection of groups; receiving an update to an IP address history for a first hostname of the plurality of hostnames, the first hostname being a member of a first group of the collection of groups, and recalculating a first kernel of the first group using the updated IP address history of the first hostname; receiving an update to an IP address history for a second hostname, the second hostname being a member of a second group of the collection of groups, and recalculating a second kernel of the second group using the updated IP address history of the second hostname; and determining that the first kernel is within the threshold distance of the second kernel and, as a result, merging the first group and the second group into a single group in the collection of groups. - View Dependent Claims (10, 11, 12, 13, 14)
-
-
15. A non-transitory computer storage medium having instructions stored thereon that, when executed by data processing apparatus, cause the data processing apparatus to perform operations comprising:
-
maintaining an Internet Protocol (IP) address history for each hostname in a plurality of hostnames, where each IP address history is a time series of IP addresses; organizing the hostnames into a collection of groups so that each hostname of the plurality of hostnames is a member of exactly one group in the collection of groups, where each group has a kernel calculated from the IP address histories of the members of the group, and where the IP address history of each member of the group is within a threshold distance of the kernel of the group; providing to a crawler, for use in scheduling a crawl of the plurality of hostnames, data describing the collection of groups; receiving an update to an IP address history for a first hostname of the plurality of hostnames, the first hostname being a member of a first group of the collection of groups, and recalculating a first kernel of the first group using the updated IP address history of the first hostname; receiving an update to an IP address history for a second hostname, the second hostname being a member of a second group of the collection of groups, and recalculating a second kernel of the second group using the updated IP address history of the second hostname; and determining that the first kernel is within the threshold distance of the second kernel and, as a result, merging the first group and the second group into a single group in the collection of groups. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22)
-
Specification