Determining malware infection risk
First Claim
Patent Images
1. A system comprising:
- one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising;
receiving data representing aggregate network traffic data as a bipartite graph between nodes representing first entities and nodes representing second entities, whereineach edge of the bipartite graph connects a node representing a respective first entity with a node representing a respective second entity,each edge of the bipartite graph has an edge weight representing a measure of the aggregate network traffic between the entities represented by nodes of the graph connected by the edge,the aggregate network traffic data represents network traffic between the first entities and the second entities, andeach of the entities is an entity communicating with one or more other entities on a data communication network;
receiving an initial collection of ground truth label values for some of the first entities, some of the second entities, or both, wherein each ground truth label value for an entity indicates that the entity is known to be safe or unsafe, and wherein each ground truth label value is either −
r or +r, wherein r is a positive real number;
computing a respective initial score for each of the first entities and each of the second entities, each initial score being a non-zero value for a respective entity that has a known ground truth label value indicating that the entity is known to be safe or unsafe or a zero value for entities that do not have a known ground truth label value, each entity with a ground truth value of −
r being assigned an initial score of −
r/B, and each entity with a ground truth value of +r being assigned an initial score of +r/A, wherein B is a count of how many values of −
r were present in the initial collection, and wherein A is a count of how many values of +r were present in the initial collection;
iteratively computing a respective final score for each of the first entities and the second entities from the initial scores and the edge weights, the final score indicating malware infection risk of a corresponding entity;
identifying, based on the final scores, one or more first entities, one or more second entities, or both, that are likely infected with malware; and
reporting the identified one or more first entities or one or more second entities to a user.
1 Assignment
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for performing semi-supervised learning on partially labeled nodes on a bipartite graph. One described method can determine a useful score of malware infection risk from partial known facts for entities modeled as nodes on a bipartite graph, where network traffic is measured between inside-the-enterprise entities and outside-the-enterprise entities. This and other methods can be implemented in a large-scale massively parallel processing database. Methods of scaling the partial label input and of presenting the results are also described.
23 Citations
28 Claims
-
1. A system comprising:
-
one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising; receiving data representing aggregate network traffic data as a bipartite graph between nodes representing first entities and nodes representing second entities, wherein each edge of the bipartite graph connects a node representing a respective first entity with a node representing a respective second entity, each edge of the bipartite graph has an edge weight representing a measure of the aggregate network traffic between the entities represented by nodes of the graph connected by the edge, the aggregate network traffic data represents network traffic between the first entities and the second entities, and each of the entities is an entity communicating with one or more other entities on a data communication network; receiving an initial collection of ground truth label values for some of the first entities, some of the second entities, or both, wherein each ground truth label value for an entity indicates that the entity is known to be safe or unsafe, and wherein each ground truth label value is either −
r or +r, wherein r is a positive real number;computing a respective initial score for each of the first entities and each of the second entities, each initial score being a non-zero value for a respective entity that has a known ground truth label value indicating that the entity is known to be safe or unsafe or a zero value for entities that do not have a known ground truth label value, each entity with a ground truth value of −
r being assigned an initial score of −
r/B, and each entity with a ground truth value of +r being assigned an initial score of +r/A, wherein B is a count of how many values of −
r were present in the initial collection, and wherein A is a count of how many values of +r were present in the initial collection;iteratively computing a respective final score for each of the first entities and the second entities from the initial scores and the edge weights, the final score indicating malware infection risk of a corresponding entity; identifying, based on the final scores, one or more first entities, one or more second entities, or both, that are likely infected with malware; and reporting the identified one or more first entities or one or more second entities to a user. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A computer-implemented method performed using one or more hardware processors, the method comprising:
-
receiving data representing aggregate network traffic data as a bipartite graph between nodes representing first entities and nodes representing second entities, wherein each edge of the bipartite graph connects a node representing a respective first entity with a node representing a respective second entity, each edge of the bipartite graph has an edge weight representing a measure of the aggregate network traffic between the entities represented by nodes of the graph connected by the edge, the aggregate network traffic data represents network traffic between the first entities and the second entities, and each of the entities is an entity communicating with one or more other entities on a data communication network; receiving an initial collection of ground truth label values for some of the first entities, some of the second entities, or both, wherein each ground truth label value for an entity indicates that the entity is known to be safe or unsafe, and wherein each ground truth label value is either −
r or +r, wherein r is a positive real number;computing a respective initial score for each of the first entities and each of the second entities, each initial score being a non-zero value for a respective entity that has a known ground truth label value indicating that the entity is known to be safe or unsafe or a zero value for entities that do not have a known ground truth label value, each entity with a ground truth value of −
r being assigned an initial score of −
r/B, and each entity with a ground truth value of +r being assigned an initial score of +r/A, wherein B is a count of how many values of −
r were present in the initial collection, and wherein A is a count of how many values of +r were present in the initial collection;iteratively computing a respective final score for each of the first entities and the second entities from the initial scores and the edge weights, the final score indicating malware infection risk of a corresponding entity; identifying, based on the final scores, one or more first entities or second entities that are likely infected with malware; and reporting, using the one or more hardware processors, the identified one or more first entities or second entities to a user. - View Dependent Claims (18, 19, 20, 21, 22)
-
-
23. One or more non-transitory computer-readable storage media encoded with instructions that, when executed by one or more computers, cause the one or more computers to perform operations comprising:
-
receiving data representing aggregate network traffic data as a bipartite graph between nodes representing first entities and nodes representing second entities, wherein each edge of the bipartite graph connects a node representing a respective first entity with a node representing a respective second entity, each edge of the bipartite graph has an edge weight representing a measure of the aggregate network traffic between the entities represented by nodes of the graph connected by the edge, the aggregate network traffic data represents network traffic between the first entities and the second entities, and each of the entities is an entity communicating with one or more other entities on a data communication network; receiving an initial collection of ground truth label values for some of the first entities, some of the second entities, or both, wherein each ground truth label value for an entity indicates that the entity is known to be safe or unsafe, and wherein each ground truth label value is either −
r or +r, wherein r is a positive real number;computing a respective initial score for each of the first entities and each of the second entities, each initial score being a non-zero value for a respective entity that has a known ground truth label value indicating that the entity is known to be safe or unsafe or a zero value for entities that do not have a known ground truth label value, each entity with a ground truth value of −
r being assigned an initial score of −
r/B, and each entity with a ground truth value of +r being assigned an initial score of +r/A, wherein B is a count of how many values of −
r were present in the initial collection, and wherein A is a count of how many values of +r were present in the initial collection;iteratively computing a respective final score for each of the first entities and the second entities from the initial scores and the edge weights, the final score indicating malware infection risk of a corresponding entity; identifying, based on the final scores, one or more first entities or second entities that are likely infected with malware; and reporting the identified one or more first entities or second entities to a user. - View Dependent Claims (24, 25, 26, 27, 28)
-
Specification