Generation and use of trained file classifiers for malware detection
First Claim
1. A computing device comprising:
- a memory configured to store instructions to generate a file classifier; and
a processor configured to execute the instructions from the memory to perform operations comprising;
accessing information identifying multiple files and identifying classification data for the multiple files, wherein the classification data indicates, for a particular file of the multiple files, whether the particular file includes malware;
generating a sequence of entropy indicators for each of the multiple files, each entropy indicator of the sequence of entropy indicators for the particular file corresponding to a chunk of the particular file;
generating n-gram vectors for the multiple files, wherein an n-gram vector for the particular file indicates occurrences of groups of entropy indicators in the sequence of entropy indicators for the particular file; and
generating and storing a file classifier using the n-gram vectors and the classification data as supervised training data, wherein the supervised training data includes a plurality of n-gram vectors for each file, the plurality of n-gram vectors for at least one file including a zero-skip n-gram vector indicating occurrences of groups of adjacent entropy indicators, and including at least one skip n-gram vector indicating occurrences of groups of non-adjacent entropy indicators.
2 Assignments
0 Petitions
Accused Products
Abstract
A method includes accessing information identifying multiple files and identifying classification data for the multiple files, where the classification data indicates, for a particular file of the multiple files, whether the particular file includes malware. The method also includes generating a sequence of entropy indicators for each of the multiple files, each entropy indicator of the sequence of entropy indicators for the particular file corresponding to a chunk of the particular file. The method further includes generating n-gram vectors for the multiple files, where the n-gram vector for the particular file indicates occurrences of groups of entropy indicators in the sequence of entropy indicators for the particular file. The method also includes generating and storing a file classifier using the n-gram vectors and the classification data as supervised training data.
84 Citations
19 Claims
-
1. A computing device comprising:
-
a memory configured to store instructions to generate a file classifier; and a processor configured to execute the instructions from the memory to perform operations comprising; accessing information identifying multiple files and identifying classification data for the multiple files, wherein the classification data indicates, for a particular file of the multiple files, whether the particular file includes malware; generating a sequence of entropy indicators for each of the multiple files, each entropy indicator of the sequence of entropy indicators for the particular file corresponding to a chunk of the particular file; generating n-gram vectors for the multiple files, wherein an n-gram vector for the particular file indicates occurrences of groups of entropy indicators in the sequence of entropy indicators for the particular file; and generating and storing a file classifier using the n-gram vectors and the classification data as supervised training data, wherein the supervised training data includes a plurality of n-gram vectors for each file, the plurality of n-gram vectors for at least one file including a zero-skip n-gram vector indicating occurrences of groups of adjacent entropy indicators, and including at least one skip n-gram vector indicating occurrences of groups of non-adjacent entropy indicators. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method comprising:
-
accessing information identifying multiple files and identifying classification data for the multiple files, wherein the classification data indicates, for a particular file of the multiple files, whether the particular file includes malware; generating a sequence of entropy indicators for each of the multiple files, each entropy indicator of the sequence of entropy indicators for the particular file corresponding to a chunk of the particular file; generating n-gram vectors for the multiple files, where an n-gram vector for the particular file indicates occurrences of groups of entropy indicators in the sequence of entropy indicators for the particular file; and generating and storing a file classifier using the n-gram vectors and the classification data as supervised training data, wherein the supervised training data includes a plurality of n-gram vectors for each file, the plurality of n-gram vectors for at least one file including a zero-skip n-gram vector indicating occurrences of groups of adjacent entropy indicators, and including at least one skip n-gram vector indicating occurrences of groups of non-adjacent entropy indicators. - View Dependent Claims (8, 9, 10, 11, 12, 13)
-
-
14. A computer-readable storage device storing instructions that, when executed, cause a computer to perform operations comprising:
-
accessing information identifying multiple files and identifying classification data for the multiple files, wherein the classification data indicates, for a particular file of the multiple files, whether the particular file includes malware; generating a sequence of entropy indicators for each of the multiple files, each entropy indicator of the sequence of entropy indicators for the particular file corresponding to a chunk of the particular file; generating n-gram vectors for the multiple files, wherein an n-gram vector for the particular file indicates occurrences of groups of entropy indicators in the sequence of entropy indicators for the particular file; and generating and storing a file classifier using the n-gram vectors and the classification data as supervised training data, wherein the supervised training data includes a plurality of n-gram vectors for each file, the plurality of n-gram vectors for at least one file including a zero-skip n-gram vector indicating occurrences of groups of adjacent entropy indicators, and including at least one skip n-gram vector indicating occurrences of groups of non-adjacent entropy indicators. - View Dependent Claims (15, 16, 17, 18, 19)
-
Specification