Method and apparatus for finding mirrored hosts by analyzing urls
First Claim
Patent Images
1. A method of determining mirrored web hosts, comprising:
- receiving information about the addresses of a plurality of web sites stored on a plurality of hosts;
determining a plurality of terms of the URLs associated with every host;
weighting the terms in inverse proportion to frequency;
determining a similarity score for host pair in accordance with the weighted terms; and
outputting a list of potential pairs of mirrored hosts in accordance with their similarity scores.
8 Assignments
0 Petitions
Accused Products
Abstract
A method and apparatus that detects mirrored host pairs using information about a large set of pages, including URLs. The identities of the detected mirrored hosts are then saved so that browsers, crawlers, proxy servers, or the like can correctly identify mirrored web sites. The described embodiments of the present invention look at the URLs of pages hosts to determine whether the hosts are potentially mirrored.
116 Citations
21 Claims
-
1. A method of determining mirrored web hosts, comprising:
-
receiving information about the addresses of a plurality of web sites stored on a plurality of hosts;
determining a plurality of terms of the URLs associated with every host;
weighting the terms in inverse proportion to frequency;
determining a similarity score for host pair in accordance with the weighted terms; and
outputting a list of potential pairs of mirrored hosts in accordance with their similarity scores. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
attempting to access selected pages on each host of a host pair, the pages corresponding to paths selected from each host, to determine whether the hosts are mirrored hosts;
categorizing the pages as having various matching categories; and
categorizing the host pairs in one of a plurality of similarity categories in accordance with the matching categories of the selected pages.
-
-
12. The method of claim 11, wherein the matching categories indicate one of:
- access of the path on a source host failed;
access of the path on a target host failed;
content is byte-wise identical;
documents are 100% similar after removal of;
common content above a threshold for high similarity; and
path is valid but no similarity.
- access of the path on a source host failed;
-
13. The method of claim 11, wherein the host pairs are divided into five similarity categories in accordance with the matching categories of the pages.
-
14. An apparatus that determines mirrored web hosts, comprising:
-
software configured to receive information about the addresses of a plurality of web sites stored on a plurality of hosts;
software configured to determine a plurality of terms of the URLs associated with every host;
software configured to weight the terms in inverse proportion to frequency;
software configured to determine a similarity score for host pair in accordance with the weighted terms; and
software configured to output a list of potential pairs of mirrored hosts in accordance with their similarity scores. - View Dependent Claims (15, 16, 17, 18, 19, 20)
-
-
21. A computer program product, comprising:
-
computer program code devices configured to receive information about the addresses of a plurality of web sites stored on a plurality of hosts;
computer program code devices configured to determine a plurality of terms of the URLs associated with every host;
computer program code devices configured to weight the terms in inverse proportion to frequency;
computer program code devices configured to determine a similarity score for host pair in accordance with the weighted terms; and
computer program code devices configured to output a list of potential pairs of mirrored hosts in accordance with their similarity scores.
-
Specification