Phylogeny generation
First Claim
1. A method for extracting features from programs comprising:
- (1) providing a processor and a receiving device;
(2) receiving a pair of programs in said processor from said receiving device;
(3) using said processor to extract one or more tokens from each said program; and
(4) using said processor to extract one or more features for each said program based on said extracted tokens, wherein said extracted features are constructed so that two sequences of successive n tokens are considered the same feature if they are permutations of the same n tokens;
using said processor to declare as the closest match between said unknown program and one or more of said collection programs the pairing of programs that yields the highest similarity score; and
using said processor to classify said unknown program in the same class as the program that was the said closest match; and
using said processor to generate a graph using said similarity score, said graph comprising a phylogeny tree.
0 Assignments
0 Petitions
Accused Products
Abstract
A method is provided for comparing malware or other types of computer programs, and for optionally using such a comparison method for (a) searching for matching programs in a collection of programs, (b) classifying programs, and (c) constructing a classification or a partitioning within a collection of programs. In general, there are three steps to the comparison portion: selecting and extracting tokens from a pair of programs for comparison, building features from these tokens, and comparing the programs based on the frequency of feature occurrences to produce a similarity measure. Pairwise similarity is then used for optionally searching, classifying, or constructing classification systems.
75 Citations
14 Claims
-
1. A method for extracting features from programs comprising:
-
(1) providing a processor and a receiving device; (2) receiving a pair of programs in said processor from said receiving device; (3) using said processor to extract one or more tokens from each said program; and (4) using said processor to extract one or more features for each said program based on said extracted tokens, wherein said extracted features are constructed so that two sequences of successive n tokens are considered the same feature if they are permutations of the same n tokens;
using said processor to declare as the closest match between said unknown program and one or more of said collection programs the pairing of programs that yields the highest similarity score; and
using said processor to classify said unknown program in the same class as the program that was the said closest match; and
using said processor to generate a graph using said similarity score, said graph comprising a phylogeny tree. - View Dependent Claims (2, 3, 4)
-
-
5. A method for searching for matches to an unknown program against a collection of programs, said searching for matches being performed by comparing said unknown program against one or more programs in said collection of programs;
- said comparison comprising the following steps;
(1) providing a processor and a receiving device; (2) receiving said unknown program and said one or more programs from said collection of programs in said processor from said receiving device; (3) using said processor to extract one or more tokens from said unknown program and said programs in said collection of programs; (4) using said processor to extract one or more features for each said program based on said extracted tokens, wherein said extracted features are constructed so that two sequences of successive n tokens are considered the same feature if they are permutations of the same n tokens; (5) using said processor to construct feature frequency vectors for each said program based on the extracted features; (6) using said processor to calculate a similarity score for the pairing of said unknown program with each said program in said collection of programs by applying a similarity function to the feature frequency vectors extracted for each said program; and (7) using said processor to declare a match between said unknown program and one or more of said programs in said collection of programs if said similarity score is at least a given minimum stated score. - View Dependent Claims (6)
- said comparison comprising the following steps;
-
7. A method for classifying an unknown program based on a collection of programs that have been partitioned into classes of programs, said method comprising:
-
(1) providing a processor and a receiving device; (2) receiving said unknown program and said collection of programs in said processor from said receiving device; (3) using said processor to extract one or more tokens from said unknown program and said programs in said collection of programs; (4) using said processor to extract one or more features for each said program based on said extracted tokens, wherein said extracted features are constructed so that two sequences of successive n tokens are considered the same feature if they are permutations of the same n tokens; (5) using said processor to construct feature frequency vectors for each said program based on the extracted features; (6) using said processor to calculate a similarity score for the pairing of said unknown program with each said program in said collection of programs by applying a similarity function to the feature frequency vectors extracted for each said program; (7) using said processor to declare as the closest match between said unknown program and one or more of said collection programs the pairing of programs that yields the highest similarity score; and (8) using said processor to classify said unknown program in the same class as the program that was the said closest match. - View Dependent Claims (8)
-
-
9. A method for partitioning a collection of programs, said method comprising:
-
(1) calculating the pair-wise similarity of all said programs within said collection of programs by the following method; (a) providing a processor and a receiving device; (b) receiving said collection of programs in said processor from said receiving device; (c) using said processor to extract one or more tokens from each said program; (d) using said processor to extract one or more features for each said program based on said extracted tokens, wherein said extracted features are constructed so that two sequences of successive n tokens are considered the same feature if they are permutations of the same n tokens; (e) using said processor to construct feature frequency vectors for each said program based on the extracted features; and (f) using said processor to calculate a similarity score for said pair of said programs by applying a similarity function to the feature frequency vectors extracted for each said program; and (2) within said processor, using said pair-wise similarity scores as a basis for partitioning said programs in said collection of programs into classes of programs. - View Dependent Claims (10, 11)
-
-
12. A method for building a phylogeny tree for categorizing software programs, said method comprising:
-
(1) providing a processor and a receiving device; (2) receiving said software programs to be categorized in said processor from said receiving device; (3) using said processor to extract features of said programs, wherein said extracted features are constructed so that two sequences of successive n tokens are considered the same feature if they are permutations of the same n tokens; (4) using said processor to compare extracted features of one of said program with the extracted features of at least one other of said programs to ascertain the degree of similarity of said extracted features of said programs, said similarity being expressed as a similarity score; and (5) using said processor to generate a graph using said similarity score, said graph comprising a phylogeny tree;
said extraction step is performed as a full disassembly, linear sweep extraction; and
when said opcodes are extracted, parameters and prefixes are discarded. - View Dependent Claims (13, 14)
-
Specification