Syntactical Fingerprinting
First Claim
Patent Images
1. A method for identifying a phishing website comprising:
- a. providing a computer system having an operating system, a database system and a communication system for controlling communications through the Internet,b. transmitting a communication containing a plurality of suspected phishing urls to the computer system,c. retrieving website content files for each suspected phishing url of the plurality of phishing urls, the website content files including structural components,d. preprocessing the website content files thereby producing normalized website content file sets for each of the plurality of suspected phishing urls,e. creating an abstract syntax tree for each of the normalized website content file sets,f. calculating a hash value for each structural component of each of the normalized website content file sets and constructing a hash value set there from for each normalized website content file set,g. selecting a first hash value from a first hash value set and comparing the first hash value to hash values of structural components of known phishing websites to locate a matching hash value,h. if a matching hash value is located, comparing the first hash value set to a hash value set of the matching hash value and creating a similarity score, andi. if the similarity score meets or exceeds a predetermined threshold, designating a suspected url from which the first hash value was derived as a phishing website.
1 Assignment
0 Petitions
Accused Products
Abstract
A method for identifying phishing websites and illustrating the provenance of each website through the structural components that compose the websites. The method includes identifying newly observed phishing websites and using the method as a distance metric for clustering phishing websites. Varying the threshold value within method demonstrates the potential capability for phishing investigators to identify the source of many phishing websites as well as individual phishers.
66 Citations
20 Claims
-
1. A method for identifying a phishing website comprising:
-
a. providing a computer system having an operating system, a database system and a communication system for controlling communications through the Internet, b. transmitting a communication containing a plurality of suspected phishing urls to the computer system, c. retrieving website content files for each suspected phishing url of the plurality of phishing urls, the website content files including structural components, d. preprocessing the website content files thereby producing normalized website content file sets for each of the plurality of suspected phishing urls, e. creating an abstract syntax tree for each of the normalized website content file sets, f. calculating a hash value for each structural component of each of the normalized website content file sets and constructing a hash value set there from for each normalized website content file set, g. selecting a first hash value from a first hash value set and comparing the first hash value to hash values of structural components of known phishing websites to locate a matching hash value, h. if a matching hash value is located, comparing the first hash value set to a hash value set of the matching hash value and creating a similarity score, and i. if the similarity score meets or exceeds a predetermined threshold, designating a suspected url from which the first hash value was derived as a phishing website. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A method for identifying a phishing website comprising:
-
a. receiving a communication containing a plurality of suspected phishing urls, b. retrieving website content files for each suspected phishing url of the plurality of phishing urls, the website content files including structural components, c. creating an abstract syntax tree for each of the website content files, d. calculating a hash value for each structural component of each of the website content files and constructing a hash value set there from for each website content file set, e. selecting a first hash value from a first hash value set and comparing the first hash value to hash values of structural components of known phishing websites to locate a matching hash value, f. if a matching hash value is located, comparing the first hash value set to a hash value set of the matching hash value and creating a similarity score, and g. if the similarity score meets or exceeds a predetermined threshold, designating a suspected url from which the first hash value was derived as a phishing website. - View Dependent Claims (18)
-
-
19. A method for identifying a phishing website comprising:
-
a. providing a computer system having an operating system, a database system and a communication system for controlling communications through the Internet, b. transmitting a communication containing a plurality of suspected phishing urls to the computer system, c. prior to step d. removing from the plurality of suspected phishing urls any suspected phishing urls that are known benign urls, known phishing urls, or urls that are duplicates of another suspected phishing url in the plurality of suspected phishing urls d. retrieving website content files for each suspected phishing url of the plurality of phishing urls, wherein the website content files include structural components and are derived from index pages of the retrieved website content files, e. preprocessing the website content files thereby producing normalized website content file sets for each of the plurality of suspected phishing urls, wherein preprocessing includes one or more of removing white space from the website content files, making the website content files case insensitive or removing dynamic content from the website content files, f. creating an abstract syntax tree for each of the normalized website content file sets, wherein creating the abstract syntax tree includes parsing HTML tags within the normalized website content file sets and constructing the abstract syntax tree of HTML entities, g. calculating a hash value for each structural component of each of the normalized website content file sets and constructing a hash value set there from for each normalized website content file set, h. selecting a first hash value from a first hash value set and comparing the first hash value to hash values of structural components of known phishing websites to locate a matching hash value, i. if a matching hash value is located, comparing the first hash value set to a hash value set of the matching hash value and creating a similarity score, and j. if the similarity score meets or exceeds a predetermined threshold, designating a suspected url from which the first hash value was derived as a phishing website. - View Dependent Claims (20)
-
Specification