Determining malware infection risk

US 10,164,995 B1
Filed: 08/14/2015
Issued: 12/25/2018
Est. Priority Date: 08/14/2014
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising;

receiving data representing aggregate network traffic data as a bipartite graph between nodes representing first entities and nodes representing second entities, whereineach edge of the bipartite graph connects a node representing a respective first entity with a node representing a respective second entity,each edge of the bipartite graph has an edge weight representing a measure of the aggregate network traffic between the entities represented by nodes of the graph connected by the edge,the aggregate network traffic data represents network traffic between the first entities and the second entities, andeach of the entities is an entity communicating with one or more other entities on a data communication network;

receiving an initial collection of ground truth label values for some of the first entities, some of the second entities, or both, wherein each ground truth label value for an entity indicates that the entity is known to be safe or unsafe, and wherein each ground truth label value is either −

r or +r, wherein r is a positive real number;

computing a respective initial score for each of the first entities and each of the second entities, each initial score being a non-zero value for a respective entity that has a known ground truth label value indicating that the entity is known to be safe or unsafe or a zero value for entities that do not have a known ground truth label value, each entity with a ground truth value of −

r being assigned an initial score of −

r/B, and each entity with a ground truth value of +r being assigned an initial score of +r/A, wherein B is a count of how many values of −

r were present in the initial collection, and wherein A is a count of how many values of +r were present in the initial collection;

iteratively computing a respective final score for each of the first entities and the second entities from the initial scores and the edge weights, the final score indicating malware infection risk of a corresponding entity;

identifying, based on the final scores, one or more first entities, one or more second entities, or both, that are likely infected with malware; and

reporting the identified one or more first entities or one or more second entities to a user.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for performing semi-supervised learning on partially labeled nodes on a bipartite graph. One described method can determine a useful score of malware infection risk from partial known facts for entities modeled as nodes on a bipartite graph, where network traffic is measured between inside-the-enterprise entities and outside-the-enterprise entities. This and other methods can be implemented in a large-scale massively parallel processing database. Methods of scaling the partial label input and of presenting the results are also described.

23 Citations

View as Search Results

28 Claims

1. A system comprising:
- one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising;
  
  receiving data representing aggregate network traffic data as a bipartite graph between nodes representing first entities and nodes representing second entities, whereineach edge of the bipartite graph connects a node representing a respective first entity with a node representing a respective second entity,each edge of the bipartite graph has an edge weight representing a measure of the aggregate network traffic between the entities represented by nodes of the graph connected by the edge,the aggregate network traffic data represents network traffic between the first entities and the second entities, andeach of the entities is an entity communicating with one or more other entities on a data communication network;
  
  receiving an initial collection of ground truth label values for some of the first entities, some of the second entities, or both, wherein each ground truth label value for an entity indicates that the entity is known to be safe or unsafe, and wherein each ground truth label value is either −
  
  r or +r, wherein r is a positive real number;
  
  computing a respective initial score for each of the first entities and each of the second entities, each initial score being a non-zero value for a respective entity that has a known ground truth label value indicating that the entity is known to be safe or unsafe or a zero value for entities that do not have a known ground truth label value, each entity with a ground truth value of −
  
  r being assigned an initial score of −
  
  r/B, and each entity with a ground truth value of +r being assigned an initial score of +r/A, wherein B is a count of how many values of −
  
  r were present in the initial collection, and wherein A is a count of how many values of +r were present in the initial collection;
  
  iteratively computing a respective final score for each of the first entities and the second entities from the initial scores and the edge weights, the final score indicating malware infection risk of a corresponding entity;
  
  identifying, based on the final scores, one or more first entities, one or more second entities, or both, that are likely infected with malware; and
  
  reporting the identified one or more first entities or one or more second entities to a user.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. The system of claim 1, wherein iteratively computing a respective final score for each entity comprises:
    - for each entity, computing at each iteration a new score for the entity from previously-determined scores for other entities having a connection to the entity in the bipartite graph.
  - 3. The system of claim 1, wherein iteratively computing a respective final score for each entity comprises:
    - for each entity, computing at each iteration a new score for the entity as a propagation score for the entity from previously-determined scores for other entities having a connection to the entity in the bipartite graph plus the initial score for the entity.
  - 4. The system of claim 3, the operations further comprising:
    - for each entity, computing the propagation score as a weighted sum of the previously-determined scores, each weight in the weighted sum being a corresponding edge weight from the bipartite graph.
  - 5. The system of claim 1, the operations further comprising:
    - obtaining network traffic data from transaction logs; and
      
      aggregating the network traffic data to generate the aggregate network traffic data.
  - 6. The system of claim 1, wherein:
    - the first entities are entities within a perimeter of perimeter entities of the data communication network; and
      
      the second entities are entities outside the perimeter of perimeter entities of the data communication network.
  - 7. The system of claim 1, wherein reporting the identified one or more first entities or one or more second entities comprises:
    - displaying multiple final scores in a sorted order, including displaying each of the multiple final score with an identifier of the first entity or the second entity of the final score.
  - 8. The system of claim 7, the operations further comprising:
    - displaying each of the multiple final scores with an indication of whether the first entity or the second entity of the final score had a known ground truth label value.
  - 9. The system of claim 1, wherein iteratively computing a respective final score for each entity comprises:
    - iteratively computing new score values x_t+1and y_t+1according to;
  - 10. The system of claim 9, wherein:
    - alpha and beta have different values, the higher value indicating a greater degree of trust in the initial scores of the respective first entities and the second entities.
  - 11. The system of claim 9, the operations further comprising:
    - precomputing
  - 12. The system of claim 11, the operations further comprising:
    - distributing U(i,j) and x_t(i) values identically partitioned by the i value to multiple worker nodes;
      
      distributing V(j,i) and y_t(j) values identically partitioned by the j value to multiple worker nodes;
      
      calculating (Σ
      
      _j∈
      
      N(i)V(j,i)×
      
      y_t(j)) in parallel on the worker nodes having the required U(i,j) and x_t(i) values; and
      
      calculating (Σ
      
      _i∈
      
      N(j)U(i,j)×
      
      x_t(i)) in parallel on the worker nodes having the required V(j,i) and y_t(j) values.
  - 13. The system of claim 9, the operations further comprising:
    - precomputing
  - 14. The system of claim 13, the operations further comprising:
    - distributing U(i) and x_t(i) values identically partitioned by the i value to multiple worker nodes;
      
      distributing V(j) and y_t(j) values identically partitioned by the j value to multiple worker nodes;
      
      calculating (Σ
      
      _j∈
      
      N(i)V(j)×
      
      y_t(j)) in parallel on the worker nodes having the required U(i) and x_t(i) values; and
      
      calculating (Σ
      
      _i∈
      
      N(j)U(i)×
      
      x_t(i)) in parallel on the worker nodes having the required V(j) and y_t(j) values.
  - 15. The system of claim 1, wherein iteratively computing a respective final score for each entity comprises:
    - precomputing U(i)=1/|N(i)| and V(j)=1/|N(j)|;
      
      distributing U(i) and V(j) values on multiple worker nodes, wherein the multiple worker nodes are partitioned by i values for U(i) and by j values for V(j);
      
      iteratively computing new score values x_t+1and y_t+1according to;
      
      x_t+1(i)=alpha×
      
      (Σ
      
      _j∈
      
      N(i)V(j)×
      
      y_t(j))+x₀(i) and
      y_t+1(j)=beta×
      
      (Σ
      
      _i∈
      
      N(j)U(i)×
      
      x_t(i))+y₀(j);
      
      wherein;
      
      x_t(i) represents a score for a first entity i at a t+1^thiterative step;
      
      y_t(i) represents a score for a second entity j at a t+1^thiterative step;
      
      alpha and beta are propagation constants;
      
      N(p) represents a set of all nodes in the bipartite graph that share an edge with vertex p;
      
      W(s,t) represents an aggregate weight of the edge in the bipartite graph between nodes s and t;
      
      x₀(i) represents an initial score for the first entity i; and
      
      y₀(j) represents an initial score for the second entity j.
  - 16. The system of claim 15, wherein:
    - alpha and beta have different values, the higher value indicating a greater degree of trust in the initial scores of the respective first entities and the second entities.

17. A computer-implemented method performed using one or more hardware processors, the method comprising:
- receiving data representing aggregate network traffic data as a bipartite graph between nodes representing first entities and nodes representing second entities, whereineach edge of the bipartite graph connects a node representing a respective first entity with a node representing a respective second entity,each edge of the bipartite graph has an edge weight representing a measure of the aggregate network traffic between the entities represented by nodes of the graph connected by the edge,the aggregate network traffic data represents network traffic between the first entities and the second entities, andeach of the entities is an entity communicating with one or more other entities on a data communication network;
  
  receiving an initial collection of ground truth label values for some of the first entities, some of the second entities, or both, wherein each ground truth label value for an entity indicates that the entity is known to be safe or unsafe, and wherein each ground truth label value is either −
  
  r or +r, wherein r is a positive real number;
  
  computing a respective initial score for each of the first entities and each of the second entities, each initial score being a non-zero value for a respective entity that has a known ground truth label value indicating that the entity is known to be safe or unsafe or a zero value for entities that do not have a known ground truth label value, each entity with a ground truth value of −
  
  r being assigned an initial score of −
  
  r/B, and each entity with a ground truth value of +r being assigned an initial score of +r/A, wherein B is a count of how many values of −
  
  r were present in the initial collection, and wherein A is a count of how many values of +r were present in the initial collection;
  
  iteratively computing a respective final score for each of the first entities and the second entities from the initial scores and the edge weights, the final score indicating malware infection risk of a corresponding entity;
  
  identifying, based on the final scores, one or more first entities or second entities that are likely infected with malware; and
  
  reporting, using the one or more hardware processors, the identified one or more first entities or second entities to a user.
- View Dependent Claims (18, 19, 20, 21, 22)
- - 18. The method of claim 17, wherein iteratively computing a respective final score for each entity comprises:
    - for each entity, computing at each iteration a new score for the entity from previously-determined scores for other entities having a connection to the entity in the bipartite graph.
  - 19. The method of claim 17, wherein iteratively computing a respective final score for each entity comprises:
    - for each entity, computing at each iteration a new score for the entity as a propagation score for the entity from previously-determined scores for other entities having a connection to the entity in the bipartite graph plus the initial score for the entity.
  - 20. The method of claim 19, comprising:
    - for each entity, computing the propagation score as a weighted sum of the previously-determined scores, each weight in the weighted sum being a corresponding edge weight from the bipartite graph.
  - 21. The method of claim 17, wherein:
    - the first entities are entities within a perimeter of perimeter entities of the data communication network; and
      
      the second entities are entities outside the perimeter of perimeter entities of the data communication network.
  - 22. The method of claim 21, comprising:
    - displaying each of the final scores with an indication of whether the first entity or the second entity of the final score had a known ground truth label value.

23. One or more non-transitory computer-readable storage media encoded with instructions that, when executed by one or more computers, cause the one or more computers to perform operations comprising:
- receiving data representing aggregate network traffic data as a bipartite graph between nodes representing first entities and nodes representing second entities, whereineach edge of the bipartite graph connects a node representing a respective first entity with a node representing a respective second entity,each edge of the bipartite graph has an edge weight representing a measure of the aggregate network traffic between the entities represented by nodes of the graph connected by the edge,the aggregate network traffic data represents network traffic between the first entities and the second entities, andeach of the entities is an entity communicating with one or more other entities on a data communication network;
  
  receiving an initial collection of ground truth label values for some of the first entities, some of the second entities, or both, wherein each ground truth label value for an entity indicates that the entity is known to be safe or unsafe, and wherein each ground truth label value is either −
  
  r or +r, wherein r is a positive real number;
  
  computing a respective initial score for each of the first entities and each of the second entities, each initial score being a non-zero value for a respective entity that has a known ground truth label value indicating that the entity is known to be safe or unsafe or a zero value for entities that do not have a known ground truth label value, each entity with a ground truth value of −
  
  r being assigned an initial score of −
  
  r/B, and each entity with a ground truth value of +r being assigned an initial score of +r/A, wherein B is a count of how many values of −
  
  r were present in the initial collection, and wherein A is a count of how many values of +r were present in the initial collection;
  
  iteratively computing a respective final score for each of the first entities and the second entities from the initial scores and the edge weights, the final score indicating malware infection risk of a corresponding entity;
  
  identifying, based on the final scores, one or more first entities or second entities that are likely infected with malware; and
  
  reporting the identified one or more first entities or second entities to a user.
- View Dependent Claims (24, 25, 26, 27, 28)
- - 24. The one or more storage media of claim 23, wherein iteratively computing a respective final score for each entity comprises:
    - for each entity, computing at each iteration a new score for the entity from previously-determined scores for other entities having a connection to the entity in the bipartite graph.
  - 25. The one or more storage media of claim 23, wherein iteratively computing a respective final score for each entity comprises:
    - for each entity, computing at each iteration a new score for the entity as a propagation score for the entity from previously-determined scores for other entities having a connection to the entity in the bipartite graph plus the initial score for the entity.
  - 26. The one or more storage media of claim 25, the operations further comprising:
    - for each entity, computing the propagation score as a weighted sum of the previously-determined scores, each weight in the weighted sum being a corresponding edge weight from the bipartite graph.
  - 27. The one or more storage media of claim 23, wherein:
    - the first entities are entities within a perimeter of perimeter entities of the data communication network; and
      
      the second entities are entities outside the perimeter of perimeter entities of the data communication network.
  - 28. The one or more storage media of claim 27, the operations further comprising:
    - displaying each of the final scores with an indication of whether the first entity or the second entity of the final score had a known ground truth label value.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Pivotal Software, Inc. (Broadcom, Inc.)
Original Assignee
Pivotal Software, Inc. (Broadcom, Inc.)
Inventors
Fang, Chunsheng, Lin, Derek Chin-Teh
Primary Examiner(s)
Lemma, Samson B

Application Number

US14/826,867
Time in Patent Office

1,229 Days
Field of Search

726 25
US Class Current
CPC Class Codes

G06F 16/9024   Graphs; Linked lists G06F16...

G06F 21/55   Detecting local intrusion o...

G06F 21/56   Computer malware detection ...

G06F 21/577   Assessing vulnerabilities a...

H04L 63/1433   Vulnerability analysis

Determining malware infection risk

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

23 Citations

28 Claims

Specification

Use Cases

Quick Links

Others

Determining malware infection risk

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

23 Citations

28 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others