Encoding machine code instructions for static feature based malware clustering
First Claim
1. A computer-implemented method for detecting malware, comprising:
- extracting a machine language instruction sequence from a computer file of unknown classification, wherein the machine language instruction sequence comprises opcodes of different length and associated operand values;
encoding the opcodes in the machine language instruction sequence into a standardized opcode sequence, the standardized opcode sequence including the opcodes without the associated operand values, wherein the opcodes in the standardized opcode sequence have a uniform length;
generating a static feature for the computer file based on the standardized opcode sequence, wherein the static feature comprises a vector describing the standardized opcode sequence; and
classifying the computer file as malware based at least in part on the static feature, wherein the classifying comprises;
grouping the computer file into a cluster of computer files having similar vectors; and
classifying the computer file as malware based on classifications of the computer files in the cluster.
5 Assignments
0 Petitions
Accused Products
Abstract
Machine language instruction sequences of computer files are extracted and encoded into standardized opcode sequences. The standardized opcodes in the sequences are of the same length and do not include operands. A multi-dimension vector is generated as a static feature for each computer file, where each element in the vector corresponds to the number of occurrences of a unique N-gram (i.e., unique sequence of N consecutive standardized opcodes) in the standardized opcode sequence for that computer file. The computer files are clustered into clusters of similarly classified files based on similarities of their static features. An unknown computer file can be classified by first grouping the file into a cluster of files with similar static features (e.g., into the cluster with the shortest average distance), and then determining the classification of that file based on the classifications of other files that belong to the same cluster.
-
Citations
17 Claims
-
1. A computer-implemented method for detecting malware, comprising:
-
extracting a machine language instruction sequence from a computer file of unknown classification, wherein the machine language instruction sequence comprises opcodes of different length and associated operand values; encoding the opcodes in the machine language instruction sequence into a standardized opcode sequence, the standardized opcode sequence including the opcodes without the associated operand values, wherein the opcodes in the standardized opcode sequence have a uniform length; generating a static feature for the computer file based on the standardized opcode sequence, wherein the static feature comprises a vector describing the standardized opcode sequence; and classifying the computer file as malware based at least in part on the static feature, wherein the classifying comprises; grouping the computer file into a cluster of computer files having similar vectors; and classifying the computer file as malware based on classifications of the computer files in the cluster. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computer system for detecting malware, comprising:
-
a non-transitory computer-readable storage medium storing executable computer program code for; extracting a machine language instruction sequence from a computer file of unknown classification, wherein the machine language instruction sequence comprises at least two opcodes of different length and associated operand values; encoding the opcodes in the machine language instruction sequence into a standardized opcode sequence, the standardized opcode sequence including the opcodes without the associated operand values, wherein the opcodes in the standardized opcode sequence have a uniform length; generating a static feature for the computer file based on the standardized opcode sequence, wherein the static feature comprises a vector describing the standardized opcode sequence; and classifying the computer file as malware based at least in part on the static feature, wherein the classifying comprises; grouping the computer file into a cluster of computer files having similar vectors; and classifying the computer file as malware based on classifications of the computer files in the cluster; and a processor for executing the computer program code. - View Dependent Claims (12, 13, 14)
-
-
15. A non-transitory computer-readable storage medium encoded with executable computer program code for detecting malware, the computer program code comprising program code for:
-
extracting a machine language instruction sequence from a computer file of unknown classification, wherein the machine language instruction sequence comprises opcodes of different length and associated operand values; encoding the opcodes in the machine language instruction sequence into a standardized opcode sequence, the standardized opcode sequence including the opcodes without the associated operand values, wherein the opcodes in the standardized opcode sequence have a uniform length; generating a static feature for the computer file based on the standardized opcode sequence, wherein the static feature comprises a vector describing the standardized opcode sequence; and classifying the computer file as malware based at least in part on the static feature, wherein the classifying comprises; grouping the computer file into a cluster of computer files having similar vectors; and classifying the computer file as malware based on classifications of the computer files in the cluster. - View Dependent Claims (16, 17)
-
Specification