Handling Noise in Training Data for Malware Detection
First Claim
1. A computer system comprising at least one processor configured to form a set of noise detectors, each noise detector of the set of noise detectors configured to de-noise a corpus of records, wherein the corpus is pre-classified into a subset of clean records and a subset of malware records prior to de-noising, and wherein de-noising the corpus comprises:
- selecting a first record and a second record from the corpus, the first record being labeled as clean and the second record being labeled as malware;
in response to selecting the first and second records, determining whether the first and second records are similar according to a set of features; and
in response, when the first and second records are similar, determine that the first and second records are noise.
2 Assignments
0 Petitions
Accused Products
Abstract
Described systems and methods allow the reduction of noise found in a corpus used for training automatic classifiers for anti-malware applications. Some embodiments target pairs of records, which have opposing labels, e.g. one record labeled as clean/benign, while the other labeled as malware. When two such records are found to be similar, they are identified as noise and are either discarded from the corpus, or relabeled. Two records may be deemed similar when, in a simple case, they share a majority of features, or, in a more sophisticated case, they are sufficiently close in a feature space according to some distance measure.
36 Citations
29 Claims
-
1. A computer system comprising at least one processor configured to form a set of noise detectors, each noise detector of the set of noise detectors configured to de-noise a corpus of records, wherein the corpus is pre-classified into a subset of clean records and a subset of malware records prior to de-noising, and wherein de-noising the corpus comprises:
-
selecting a first record and a second record from the corpus, the first record being labeled as clean and the second record being labeled as malware; in response to selecting the first and second records, determining whether the first and second records are similar according to a set of features; and in response, when the first and second records are similar, determine that the first and second records are noise. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A method comprising:
-
employing at least one processor of a computer system to select a first record and a second record from a corpus, wherein the corpus is pre-classified into a subset of clean records and a subset of malware records prior to selecting the first and second records, and wherein the first record is labeled as clean and the second record is labeled as malware; in response to selecting the first and second records, employing the at least one processor to determine whether the first and second records are similar according to a set of features; and in response, when the first and second records are similar, employing the at least one processor to determine that the first and second records are noise. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
-
-
29. A computer readable medium storing a set of instructions, which, when executed by a computer system, cause the computer system to form a record aggregator and a noise detector connected to the record aggregator, wherein the record aggregator is configured to:
-
assign records of a corpus to a plurality of clusters, wherein each record of the corpus is pre-labeled as either clean or malware prior to assigning records to the plurality of clusters, and wherein all members of a cluster of the plurality of clusters share a selected set of record features; and in response to assigning the records to the plurality of clusters, send a target cluster of the plurality of clusters to the noise detector for de-noising; and wherein the noise detector is configured, in response to receiving the target cluster, to; select a first record and a second record from the target cluster, the first record being labeled as clean and the second record being labeled as malware; in response to selecting the first and second records, determine whether the first and second records are similar according to a set of features; and in response, when the first and second records are similar, determine that the first and second records are noise.
-
Specification