Identification of mislabeled samples via phantom nodes in label propagation
First Claim
1. A computer-implemented method for protecting computing devices from mislabeled malware, the method comprising:
- creating a graph from a plurality of sample executable files by executing the sample executable files in an isolated execution environment, the graph including sample file nodes associated with the sample executable files and behavior nodes associated with behavior signatures, wherein edges in the graph connect a behavior node with a set of one or more sample file nodes, wherein the one or more sample executable files associated with the one or more sample file nodes exhibit the behavior signature associated with the behavior node;
receiving data indicating a label distribution of a neighbor node of a sample file node in the graph;
in response to determining that a current label for the sample file node is unknown, setting the current label distribution for the sample file node to a consensus of label distributions of neighboring nodes;
in response to determining that the current label for the sample file node is known, performing operations including;
creating a phantom node associated with the sample file node,determining a neighborhood opinion for the phantom node, based at least in part on the label distribution of the neighboring nodes,determining a difference between the neighborhood opinion and the current label for the sample file node, anddetermining whether the current label is incorrect based, at least in part, on the difference; and
in response to determining that the current label for the sample file node is incorrect, performing at least one remedial action on the sample executable file associated with the sample file node having the incorrect current label.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems and method identify potentially mislabeled file samples. A graph is created from a plurality of sample files. The graph includes nodes associated with the sample files and behavior nodes associated with behavior signatures. Phantom nodes are created in the graph for those sample files having a known label. During a label propagation operation, a node receives data indicating a label distribution of a neighbor node in the graph. In response to determining that the current label for the node is known, a neighborhood opinion is determined for the associated phantom node, based at least in part on the label distribution of the neighboring nodes. After the label propagation operation has completed, differences between the neighborhood opinion and the current label distribution for nodes are determined. If the difference exceeds a threshold, then the current label may be incorrect.
-
Citations
19 Claims
-
1. A computer-implemented method for protecting computing devices from mislabeled malware, the method comprising:
-
creating a graph from a plurality of sample executable files by executing the sample executable files in an isolated execution environment, the graph including sample file nodes associated with the sample executable files and behavior nodes associated with behavior signatures, wherein edges in the graph connect a behavior node with a set of one or more sample file nodes, wherein the one or more sample executable files associated with the one or more sample file nodes exhibit the behavior signature associated with the behavior node; receiving data indicating a label distribution of a neighbor node of a sample file node in the graph; in response to determining that a current label for the sample file node is unknown, setting the current label distribution for the sample file node to a consensus of label distributions of neighboring nodes; in response to determining that the current label for the sample file node is known, performing operations including; creating a phantom node associated with the sample file node, determining a neighborhood opinion for the phantom node, based at least in part on the label distribution of the neighboring nodes, determining a difference between the neighborhood opinion and the current label for the sample file node, and determining whether the current label is incorrect based, at least in part, on the difference; and in response to determining that the current label for the sample file node is incorrect, performing at least one remedial action on the sample executable file associated with the sample file node having the incorrect current label. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system comprising:
-
at least one processor; and a non-transitory computer readable storage medium having a program stored thereon, the program causing the at least one processor to execute the steps of; (a) creating a graph from a plurality of sample executable files by executing the sample executable files in an isolated execution environment, the graph including sample file nodes associated with the sample executable files and behavior nodes associated with behavior signatures, wherein edges in the graph connect a behavior node with a set of one or more sample file nodes, wherein the one or more sample executable files associated with the one or more sample file nodes exhibit the behavior signature associated with the behavior node; (b) receiving data indicating a label distribution of a neighbor node of a sample file node in the graph; (c) in response to determining that a current label for the sample file node is unknown, setting the current label distribution for the sample file node to a consensus of label distributions of neighboring nodes, (d) in response to determining that the current label for the sample file node is known, performing operations including; (i) creating a phantom node associated with the sample file node, (ii) determining a neighborhood opinion for the phantom node, based at least in part on the label distribution of the neighboring nodes, (iii) determining a difference between the neighborhood opinion and the current label for the sample file node, and (iv) determining whether the current label is incorrect based, at least in part, on the difference; and (e) in response to determining that the current label for the sample file node is incorrect, performing at least one remedial action on the sample executable file associated with the sample file node having the incorrect current label. - View Dependent Claims (9, 10, 11, 12, 13)
-
-
14. A non-transitory computer readable storage medium comprising a set of instructions executable by a computer, the non-transitory computer readable storage medium comprising:
-
instructions for creating a graph from a plurality of sample executable files by executing the sample executable files in an isolated execution environment, the graph including sample file nodes associated with the sample executable files and behavior nodes associated with behavior signatures, wherein edges in the graph connect a behavior node with a set of one or more sample file nodes, wherein the one or more sample executable files associated with the one or more sample file nodes exhibit the behavior signature associated with the behavior node; instructions for receiving data indicating a label distribution of a neighbor node of a sample file node in the graph; instructions for, in response to determining that a current label for the sample file node is unknown, setting the current label distribution for the sample file node to a consensus of label distributions of neighboring nodes; instructions for, in response to determining that the current label for the sample file node is known, performing operations including; creating a phantom node associated with the sample file node, determining a neighborhood opinion for the phantom node, based at least in part on the label distribution of the neighboring nodes, determining a difference between the neighborhood opinion and the current label for the sample file node, and determining whether the current label is incorrect based, at least in part on the difference; and instructions for, in response to determining that the current label for the sample file node is incorrect, performing at least one remedial action on the sample executable file associated with the sample file node having the incorrect current label. - View Dependent Claims (15, 16, 17, 18, 19)
-
Specification