Using hash signatures of DOM objects to identify website similarity
First Claim
1. A method for determining a similarity between two websites, the method comprising, at a computer system:
- receiving website information from a web server corresponding to a website;
rendering a document object model (DOM) object of the website using the website information;
separating content within the DOM object into a plurality of data portions, each of the plurality of data portions having a fixed length;
generating, by a hardware processor of the computer system, a hash signature of the DOM object by;
applying a predetermined number of hashing functions to each of the plurality of data portions, wherein the predetermined number of hashing functions are generated using a common seed value, and wherein applying the predetermined number of hashing functions results in a predetermined number of values for each of the plurality of data portions; and
selecting, using a selection policy, a predetermined number of hashed data portions of the plurality of hashed data portions, wherein the predetermined number of hashed data portions are selected to create a hash signature of the DOM object;
comparing the hash signature of the DOM object to a known hash signature of a DOM object associated with a known website having a first classification, wherein comparing the hash signature of the DOM object to the known hash signature of the DOM object associated with the known website includes comparing each of the plurality of hashed data portions to a plurality of known hashed data portions of the known hash signature;
calculating a similarity measurement between the hash signature of the DOM object and the known hash signature of the DOM object associated with the known website;
comparing the similarity measurement to a threshold; and
determining that the website has the first classification based on the similarity measurement exceeding the threshold.
3 Assignments
0 Petitions
Accused Products
Abstract
Embodiments are directed to using a hash signature of a rendered DOM object of a website to find similar content and behavior on other websites. Embodiments break a DOM into a large number of data portions (i.e., “shingles”), apply a hashing algorithm to the shingles, select a predetermined number of hashes from the hashed shingles according to a selection criteria to create a hash signature, and compare the hash signature to that of a reference page to determine similarity of website DOM object content. Embodiments can be used to identify phishing websites, defaced websites, spam websites, significant changes in the content of a webpage, copyright infringement, and any other suitable purposes related to the similarity between website DOM object content.
-
Citations
20 Claims
-
1. A method for determining a similarity between two websites, the method comprising, at a computer system:
-
receiving website information from a web server corresponding to a website; rendering a document object model (DOM) object of the website using the website information; separating content within the DOM object into a plurality of data portions, each of the plurality of data portions having a fixed length; generating, by a hardware processor of the computer system, a hash signature of the DOM object by; applying a predetermined number of hashing functions to each of the plurality of data portions, wherein the predetermined number of hashing functions are generated using a common seed value, and wherein applying the predetermined number of hashing functions results in a predetermined number of values for each of the plurality of data portions; and selecting, using a selection policy, a predetermined number of hashed data portions of the plurality of hashed data portions, wherein the predetermined number of hashed data portions are selected to create a hash signature of the DOM object; comparing the hash signature of the DOM object to a known hash signature of a DOM object associated with a known website having a first classification, wherein comparing the hash signature of the DOM object to the known hash signature of the DOM object associated with the known website includes comparing each of the plurality of hashed data portions to a plurality of known hashed data portions of the known hash signature; calculating a similarity measurement between the hash signature of the DOM object and the known hash signature of the DOM object associated with the known website; comparing the similarity measurement to a threshold; and determining that the website has the first classification based on the similarity measurement exceeding the threshold. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A computer comprising:
-
a processor; and a computer product coupled to the processor, the computer product comprising code, executable by the processor, to provide a computer program configured to perform a method of determining a similarity between two websites, the method comprising; receiving website information from a web server corresponding to a website; rendering a document object model (DOM) object of the website using the website information; separating content within the DOM object into a plurality of data portions, each of the plurality of data portions having a fixed length; generating a hash signature of the DOM object by; applying a predetermining number of hashing functions to each of the plurality of data portions, wherein the predetermined number of hashing functions are generated using a common seed value, and wherein applying the predetermined number of hashing functions results in a predetermined number of values for each of the plurality of data portions; and selecting, using a selection policy, a predetermined number of hashed data portions of the plurality of hashed data portions, wherein the predetermined number of hashed data portions are selected to create a hash signature of the DOM object; comparing the hash signature of the DOM object to a known hash signature of a DOM object associated with a known website having a first classification, wherein comparing the hash signature of the DOM object to the known hash signature of the DOM object associated with the known website includes comparing each of the plurality of hashed data portions to a plurality of known hashed data portions of the known hash signature; calculating a similarity measurement between the hash signature of the DOM object and the known hash signature of the DOM object associated with the known website; comparing the similarity measurement to a threshold; and determining that the website has the first classification based on the similarity measurement exceeding the threshold. - View Dependent Claims (15, 16, 17)
-
-
18. A system comprising:
-
a web server computer configured to serve website information associated with a website; and a similarity analysis computer communicatively coupled to the web server computer through a network connection, the similarity analysis computer configured to; receive website information from the web server computer corresponding to the website; render a document object model (DOM) object of the website using the website information; separate content within the DOM object into a plurality of data portions, each of the plurality of data portions having a fixed length; generate a hash signature of the DOM object by; apply a predetermined number of hashing functions to each of the plurality of data portions, wherein the predetermined number of hashing functions are generated using a common seed value, and wherein applying the predetermined number of hashing functions results in a predetermined number of values for each of the plurality of data portions; and select, using a selection policy, a predetermined number of hashed data portions of the plurality of hashed data portions, wherein the predetermined number of hashed data portions are selected to create a hash signature of the DOM object; compare the hash signature of the DOM object to a known hash signature of a DOM object associated with a known website having a first classification, wherein comparing the hash signature of the DOM object to the known hash signature of the DOM object associated with the known website includes comparing each of the plurality of hashed data portions to a plurality of known hashed data portions of the known hash signature; calculate a similarity measurement between the hash signature of the DOM object and the known hash signature of the DOM object associated with the known website; compare the similarity measurement to a threshold; and determine that the website has the first classification based on the similarity measurement exceeding the threshold. - View Dependent Claims (19, 20)
-
Specification