Automated detection of malware using trained neural network-based file classifiers and machine learning
First Claim
1. A computing device comprising:
- a memory configured to store instructions; and
a processor configured to execute the instructions from the memory to perform operations comprising;
generating zero-skip n-gram data for a first subset of files of multiple files included in an application file package, first zero-skip n-gram data of the zero-skip n-gram data indicating occurrences of adjacent characters in printable characters representing a first file of the first subset of files;
generating skip n-gram data for the first subset of files, first skip n-gram data of the skip n-gram data indicating occurrences of non-adjacent characters in the printable characters representing the first file;
generating n-gram data for the first subset of files, first n-gram data of the n-gram data indicating occurrences of groups of entropy indicators in a first set of entropy indicators derived from first file entropy data for the first file, each entropy indicator of the first set of entropy indicators having a value representing entropy of a corresponding chunk of the first file;
generating a first feature vector based on the zero-skip n-gram data, the skip n-gram data, and the n-gram data;
generating a second feature vector based on occurrences of attributes in a second subset of files of the multiple files;
sending the first feature vector and the second feature vector to a second computing device as inputs to a file classifier; and
receiving, from the second computing device, classification data associated with the application file package based on the first feature vector and the second feature vector, the classification data indicating whether the application file package includes malware.
2 Assignments
0 Petitions
Accused Products
Abstract
Automated malware detection for application file packages using machine learning (e.g., trained neural network-based classifiers) is described. A particular method includes generating, at a first device, a first feature vector based on occurrences of character n-grams corresponding to a first subset of files of multiple files of an application file package. The method includes generating, at the first device, a second feature vector based on occurrences of attributes in a second subset of files of the multiple files. The method includes sending the first feature vector and the second feature vector from the first device to a second device as inputs to a file classifier. The method includes receiving, at the first device from the second device, classification data associated with the application file package based on the first feature vector and the second feature vector. The classification data indicates whether the application file package includes malware.
131 Citations
25 Claims
-
1. A computing device comprising:
-
a memory configured to store instructions; and a processor configured to execute the instructions from the memory to perform operations comprising; generating zero-skip n-gram data for a first subset of files of multiple files included in an application file package, first zero-skip n-gram data of the zero-skip n-gram data indicating occurrences of adjacent characters in printable characters representing a first file of the first subset of files; generating skip n-gram data for the first subset of files, first skip n-gram data of the skip n-gram data indicating occurrences of non-adjacent characters in the printable characters representing the first file; generating n-gram data for the first subset of files, first n-gram data of the n-gram data indicating occurrences of groups of entropy indicators in a first set of entropy indicators derived from first file entropy data for the first file, each entropy indicator of the first set of entropy indicators having a value representing entropy of a corresponding chunk of the first file; generating a first feature vector based on the zero-skip n-gram data, the skip n-gram data, and the n-gram data; generating a second feature vector based on occurrences of attributes in a second subset of files of the multiple files; sending the first feature vector and the second feature vector to a second computing device as inputs to a file classifier; and receiving, from the second computing device, classification data associated with the application file package based on the first feature vector and the second feature vector, the classification data indicating whether the application file package includes malware. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A method comprising:
-
generating, at a first device, zero-skip n-gram data for a first subset of files of multiple files included in an application file package, first zero-skip n-gram data of the zero-skip n-gram data indicating occurrences of adjacent characters in printable characters representing a first file of the first subset of files; generating, at the first device, skip n-gram data for the first subset of files, first skip n-gram data of the skip n-gram data indicating occurrences of non-adjacent characters in the printable characters representing the first file; generating, at the first device, n-gram data for the first subset of files, first n-gram data of the n-gram data indicating occurrences of groups of entropy indicators in a first set of entropy indicators derived from first file entropy data for the first file, each entropy indicator of the first set of entropy indicators having a value representing entropy of a corresponding chunk of the first file; generating, at the first device, a first feature vector based on the zero-skip n-gram data, the skip n-gram data, and the n-gram data; generating, at the first device, a second feature vector based on occurrences of attributes in a second subset of files of the multiple files; sending the first feature vector and the second feature vector from the first device to a second device as inputs to a file classifier; and receiving, at the first device from the second device, classification data associated with the application file package based on the first feature vector and the second feature vector, the classification data indicating whether the application file package includes malware. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21)
-
-
22. A computer-readable storage device storing instructions that, when executed, cause a computer to perform operations comprising:
-
generating zero-skip n-gram data for a first subset of files of multiple files included in an application file package, first zero-skip n-gram data of the zero-skip n-gram data indicating occurrences of adjacent characters in printable characters representing a first file of the first subset of files; generating skip n-gram data for the first subset of files, first skip n-gram data of the skip n-gram data indicating occurrences of non-adjacent characters in the printable characters representing the first file; generating n-gram data for the first subset of files, first n-gram data of the n-gram data indicating occurrences of groups of entropy indicators in a first set of entropy indicators derived from first file entropy data for the first file, each entropy indicator of the first set of entropy indicators having a value representing entropy of a corresponding chunk of the first file; generating a first feature vector based on the zero-skip n-gram data, the skip n-gram data, and the n-gram data; generating a second feature vector is based on occurrences of attributes in a second subset of files of the multiple files; sending the first feature vector and the second feature vector to a computing device as inputs to a file classifier; and receiving, from the computing device, classification data associated with the application file package based on the first feature vector and the second feature vector, the classification data indicating whether the application file package includes malware. - View Dependent Claims (23, 24, 25)
-
Specification