Method and apparatus for finding mirrored hosts by analyzing connectivity and IP addresses
First Claim
1. A method of determining mirrored web sites, comprising:
- receiving information about a plurality of web sites stored on a plurality of hosts;
determining a list of host pairs that are potentially mirrored hosts, wherein mirrored hosts contain highly similar documents within the same path; and
analyzing the list of pairs of potential mirrored hosts to determine which of the host pairs are mirrored hosts.
8 Assignments
0 Petitions
Accused Products
Abstract
A method and system that detects mirrored host pairs using information about a large set of pages, including one or more of: URLs, IP addresses, and connectivity information. The identities of the detected mirrored hosts are then saved so that browsers, crawlers, proxy servers, or the like can correctly identify mirrored web sites. The described embodiments of the present invention use one or a combination of techniques to identify mirrors. A first group of techniques involves determining mirrors based on URLs and information about connectivity (i.e., hyperlinks) between pages. A second group of techniques looks at connectivity information at a higher granularity, considering all links from all pages on a host as one group and ignoring the target of each link beyond the host level.
168 Citations
31 Claims
-
1. A method of determining mirrored web sites, comprising:
-
receiving information about a plurality of web sites stored on a plurality of hosts;
determining a list of host pairs that are potentially mirrored hosts, wherein mirrored hosts contain highly similar documents within the same path; and
analyzing the list of pairs of potential mirrored hosts to determine which of the host pairs are mirrored hosts. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 21, 22, 23)
receiving information about the IP addresses of the plurality of web sites stored on the plurality of hosts;
determining clusters of hosts, where all web sites in a cluster have the same IP addresses; and
determining that the hosts in clusters of hosts having less than or equal to a threshold number of hosts therein are mirrored web hosts.
-
-
10. The method of claim 9, wherein at least the first three octets of the IP address of all hosts in a cluster are identical.
-
20. The method of claim 1, wherein the analyzing step attempts to access selected pages on each host from both hosts to determine whether the hosts are mirrored hosts.
-
21. The method of claim 1, wherein the analyzing step further includes:
-
attempting to access pages from both hosts corresponding to paths selected from each of the hosts to determine whether the hosts are mirrored hosts;
categorizing the selected pages as having various matching categories; and
categorizing the hosts pairs in one of a plurality of similarity categories in accordance with the matching categories of the selected pages.
-
-
22. The method of claim 21, wherein the matching categories indicate one of:
- access of the path on the source host failed;
access of the path on the target host failed;
content is byte-wise identical;
documents are 100% similar after removal of HTML tags, whitespace, etc.;
common content above a threshold for high similarity (e.g., 50%); and
path is valid but no similarity.
- access of the path on the source host failed;
-
23. The method of claim 21, wherein the host pairs are divided into five similarity categories in accordance with the matching categories of the selected pages.
-
11. A method of determining mirrored web hosts, comprising:
-
receiving information about the addresses of a plurality of web sites stored on a plurality of hosts and about page level connectivity information of the plurality of web sites and a list of potentially mirrored hosts pairs, wherein mirrored hosts contain highly similar documents within the same path; and
filtering the list of potential mirrored hosts pairs in accordance with the page level connectivity information, wherein mirrored hosts contain highly similar documents within the same path. - View Dependent Claims (12, 13, 14, 15)
selecting, for a host pair, 2*n page paths known to be present on both hosts in the host pair and having a high outdegree;
for each of the 2*n page paths, counting a percentage of outgoing links common to the two pages corresponding to a page path; and
determining that if the counted percentage is greater than a threshold, that the page paths “
match.”
-
-
14. The method of claim 13, further including:
-
determining what percentage of the 2*n page paths for a particular host pair “
match”
; and
if the percentage is above a certain threshold, determining that the host pair represents potential mirrored hosts.
-
-
15. The method of claim 13, further including:
before determining a “
match”
between pages corresponding to a page path, if the page path points to one of the hosts in the host pair, removing the part of the URL for the host, resulting in a relative URL.
-
16. A method of determining mirrored web hosts, comprising:
-
receiving information about the addresses of a plurality of web sites stored on a plurality of hosts and about connectivity information of the plurality of web sites;
for each host, determining a set of terms for the host, indicating those hosts that are targets of incoming links from some page on the host;
for each term, determining the frequency, which equals the number of such incoming links;
for each host, selecting the terms with the highest frequency;
for each host, weighting the terms; and
using term vector matching to determine the likelihood of a pair of hosts being mirrors in accordance with the weighted terms of the pair of hosts. - View Dependent Claims (17, 18, 19)
using a weighting function based on an indegree of the term.
-
-
18. The method of claim 16, wherein weighting the term includes using a weighting function of:
if in(t)<
=25 then term-weight is 1, otherwise term-weight is MIN(1/5,200/in(t)), where in(t) of a host t is the number of hosts that have links to it.
-
19. The method of claim 16, wherein weighting the term includes using a weighting function of:
if in(t)<
=25 then term-weight is 1, otherwise term-weight is MIN(1/5,200/in(t)), where in(t) of a host t is the number of hosts that have links to it; and
further multiplying each term weight by 1+log(maxin/in(t)), where in(t) of a host t is the number of hosts that have links to it, where maxin is the highest value of in(t) amongst all terms of all hosts.
-
24. An apparatus that determines mirrored web sites, comprising:
-
software configured to receive information about a plurality of web sites stored on a plurality of hosts;
software configured to determine a list of host pairs that are potentially mirrored hosts, wherein mirrored hosts contain highly similar documents within the same path; and
software configured to analyze the list of pairs of potential mirrored hosts to determine which of the host pairs are mirrored hosts. - View Dependent Claims (25)
software configured to receive information about the IP addresses of the plurality of web sites stored on the plurality of hosts;
software configured to determine clusters of hosts, where all web sites in cluster have the same IP addresses; and
software configured to determine that the hosts in clusters of hosts having less than or equal to a threshold number of hosts therein are mirrored web hosts.
-
-
26. An apparatus that determines mirrored web hosts, comprising:
-
software configured to receive information about the addresses of a plurality of web sites stored on a plurality of hosts and about page level connectivity information of the plurality of web sites and a list of potentially mirrored hosts pairs, wherein mirrored hosts contain highly similar documents within the same path; and
software configured to filter the list of potential mirrored hosts pairs in accordance with the page level connectivity information.
-
-
27. An apparatus that determines mirrored web hosts, comprising:
-
software configured to receive information about the addresses of a plurality of web sites stored on a plurality of hosts and about connectivity information of the plurality of web sites;
software configured to, for each host, determine a set of terms for the host, indicating those hosts that are targets of incoming links from some page on the host;
software configured to, for each term, determine the frequency, which equals the number of such incoming links;
software configured to, for each host, select the terms with the highest frequency;
software configured to, for each host, weight the terms; and
software configured to use term vector matching to determine the likelihood of a pair of hosts being mirrors in accordance with the weighted terms of the pair of hosts.
-
-
28. A computer program product, comprising:
-
computer program code devices configured to receive information about a plurality of web sites stored on a plurality of hosts;
computer program code devices configured to determine a list of host pairs that are potentially mirrored hosts, wherein mirrored hosts contain highly similar documents within the same path; and
computer program code devices configured to analyze the list of pairs of potential mirrored hosts to determine which of the host pairs are mirrored hosts. - View Dependent Claims (29)
computer program code devices configured to receive information about the IP addresses of the plurality of web sites stored on the plurality of hosts;
computer program code devices configured to determine clusters of hosts, where all web sites in cluster have the same IP addresses; and
computer program code devices configured to determine that the hosts in clusters of hosts having less than or equal to a threshold number of hosts therein are mirrored web hosts.
-
-
30. A computer program product, comprising:
-
computer program code devices configured to receive information about the addresses of a plurality of web sites stored on a plurality of hosts and about page level connectivity information of the plurality of web sites and a list of potentially mirrored hosts pairs, wherein mirrored hosts contain highly similar documents within the same path; and
computer program code devices configured to filter the list of potential mirrored hosts pairs in accordance with the page level connectivity information.
-
-
31. A computer program product, comprising:
-
computer program code devices configured to receive information about the addresses of a plurality of web sites stored on a plurality of hosts and about connectivity information of the plurality of web sites;
computer program code devices configured to, for each host, determine a set of terms for the host, indicating those hosts that are targets of incoming links from some page on the host;
computer program code devices configured to, for each term, determine the frequency, which equals the number of such incoming links;
computer program code devices configured to, for each host, select the terms with the highest frequency;
computer program code devices configured to, for each host, weight the terms; and
computer program code devices configured to use term vector matching to determine the likelihood of a pair of hosts being mirrors in accordance with the weighted terms of the pair of
-
Specification