Inferring file and website reputations by belief propagation leveraging machine reputation

US 8,341,745 B1
Filed: 02/22/2010
Issued: 12/25/2012
Est. Priority Date: 02/22/2010
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for detecting malicious computer files, comprising:

generating a graph comprising nodes representing a plurality of clients and computer files residing thereon, wherein distinct clients and distinct computer files are represented by distinct nodes in the graph, wherein a node representing a client is connected to nodes representing computer files residing on that client through edges;

determining priors for nodes in the graph and edge potentials for edges in the graph based on domain knowledge, wherein a prior for a node representing a client comprises an assessment of a likelihood of the client getting infected by malware based on the domain knowledge, a prior for a node representing a computer file comprises an assessment of a likelihood of the computer file being malware based on the domain knowledge, and an edge potential reflects a relationship between nodes connected by an associated edge based on the domain knowledge;

iteratively propagating a probability of a computer file being legitimate among the nodes by transmitting messages along the edges in the graph, wherein a message transmitted by a node is generated based on a prior of the node and messages received by the node during any previous iterations; and

determining whether a computer file is classified as malicious based on a probability associated with a corresponding node in the graph.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The probability of a computer file being malware is inferred by iteratively propagating domain knowledge among computer files, related clients, and/or related source domains. A graph is generated to include machine nodes representing clients, file nodes representing files residing on the clients, and optionally domain nodes representing source domains hosting the files. The graph also includes edges connecting the machine nodes with the related file nodes, and optionally edges connecting the domain nodes with the related file nodes. Priors and edge potentials are set for the nodes and the edges based on related domain knowledge. The domain knowledge is iteratively propagated and aggregated among the connected nodes through exchanging messages among the connected nodes. The iteration process ends when a stopping criterion is met. The classification and associated marginal probability for each file node are calculated based on the priors, the received messages, and the edge potentials associated with the edges through which the messages were received.

180 Citations

20 Claims

1. A computer-implemented method for detecting malicious computer files, comprising:
- generating a graph comprising nodes representing a plurality of clients and computer files residing thereon, wherein distinct clients and distinct computer files are represented by distinct nodes in the graph, wherein a node representing a client is connected to nodes representing computer files residing on that client through edges;
  
  determining priors for nodes in the graph and edge potentials for edges in the graph based on domain knowledge, wherein a prior for a node representing a client comprises an assessment of a likelihood of the client getting infected by malware based on the domain knowledge, a prior for a node representing a computer file comprises an assessment of a likelihood of the computer file being malware based on the domain knowledge, and an edge potential reflects a relationship between nodes connected by an associated edge based on the domain knowledge;
  
  iteratively propagating a probability of a computer file being legitimate among the nodes by transmitting messages along the edges in the graph, wherein a message transmitted by a node is generated based on a prior of the node and messages received by the node during any previous iterations; and
  
  determining whether a computer file is classified as malicious based on a probability associated with a corresponding node in the graph.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. The computer-implemented method of claim 1, wherein a node representing a computer file is connected to a plurality of nodes representing a plurality of clients on which the computer file resides.
  - 3. The method of claim 1, wherein determining priors for nodes in the graph based on domain knowledge comprises:
    - identifying a node representing a computer file known to be legitimate; and
      
      assigning a high value to a prior of the node representing the computer file known to be legitimate, wherein the high value indicates a low likelihood of the computer file being malware.
  - 4. The method of claim 1, wherein determining priors for nodes in the graph based on domain knowledge comprises:
    - identifying a node representing a computer file known to be malicious; and
      
      assigning a low value to a prior of the node representing the computer file known to be malicious, wherein the low value indicates a high likelihood of the computer file being malware.
  - 5. The method of claim 1, wherein determining priors for nodes in the graph based on domain knowledge comprises:
    - identifying a node representing a computer file appearing on a large number of clients; and
      
      assigning a high value to a prior of the node representing the computer file appearing on a large number of clients, wherein the high value indicates a low likelihood of the computer file being malware.
  - 6. The method of claim 1, wherein determining priors for nodes in the graph based on domain knowledge comprises:
    - identifying a node representing a computer file appearing on few clients; and
      
      assigning a low value to a prior of the node representing the computer file appearing on few clients, wherein the low value indicates a high likelihood of the computer file being malware.
  - 7. The method of claim 1, wherein determining edge potentials for edges in the graph based on domain knowledge comprises:
    - assigning a value to an edge potential for an edge between a node representing a client and a node representing a computer file that captures a homophilic machine-file relationship, wherein the homophilic machine-file relationship describes that legitimate computer files are more likely to appear on clients with good reputations and malicious computer files are more likely to appear on clients with low reputations.
  - 8. The method of claim 1, wherein the message transmitted by the node is generated based on the prior of the node, messages received during previous iterations by the node, and edge potentials associated with edges the received messages were transmitted along.
  - 9. The method of claim 1, wherein the probability of the computer file being legitimate is determined based on a prior of a node representing the computer file, messages received by the node, and edge potentials associated with edges the received messages were transmitted along.
  - 10. The method of claim 1, wherein the iteratively propagating step terminates iterating responsive to the probability for the node converging within a predetermined threshold value.
  - 11. The method of claim 1, wherein the iteratively propagating step terminates iterating responsive to a predetermined number of iterations have been completed.
  - 12. The method of claim 1, wherein the iteratively propagating step terminates iterating responsive to a true positive rate of malware being correctly classified malicious based on probabilities associated with corresponding nodes in the graph.
  - 13. The method of claim 1, wherein determining whether a computer file is classified as malicious based on a probability associated with a corresponding node in the graph comprises:
    - comparing a probability associated with a node in the graph with a threshold value; and
      
      determining that a computer file represented by the node is malware responsive to the determination.
  - 14. The method of claim 1, wherein the graph further comprises nodes representing a plurality of source domains hosting the computer files, wherein distinct source domains are represented by distinct nodes in the graph, wherein a node representing a source domain is connected to nodes representing computer files hosted by that source domain through edges, wherein a prior for a node representing a source domain comprises an assessment of a likelihood for the source domain hosting malware based on the domain knowledge.
  - 15. The method of claim 14, further comprising:
    - iteratively propagating a probability of a source domain being unlikely to host malware among the nodes by transmitting messages along the edges in the graph, wherein a message transmitted by a node is generated based on a prior of the node and messages received by the node during any previous iterations; and
      
      determining whether a source domain is likely to host malware based on a probability associated with a corresponding node in the graph.
  - 16. The method of claim 1, wherein the graph further comprises nodes representing one or more of the following:
    - software publishers, signers of digital signatures, and file names.

17. A computer system for detecting malicious computer files, comprising:
- a non-transitory computer-readable storage medium storing executable computer program code, the computer program code comprising program code for;
  
  generating a graph comprising nodes representing a plurality of clients and computer files residing thereon, wherein distinct clients and distinct computer files are represented by distinct nodes in the graph, wherein a node representing a client is connected to nodes representing computer files residing on that client through edges;
  
  determining priors for nodes in the graph and edge potentials for edges in the graph based on domain knowledge, wherein a prior for a node representing a client comprises an assessment of a likelihood of the client getting infected by malware based on the domain knowledge, a prior for a node representing a computer file comprises an assessment of a likelihood of the computer file being malware based on the domain knowledge, and an edge potential reflects a relationship between nodes connected by an associated edge based on the domain knowledge;
  
  iteratively propagating a probability of a computer file being legitimate among the nodes by transmitting messages along the edges in the graph, wherein a message transmitted by a node is generated based on a prior of the node and messages received by the node during any previous iterations; and
  
  determining whether a computer file is classified as malicious based on a probability associated with a corresponding node in the graph.
- View Dependent Claims (18)
- - 18. The computer system of claim 17, wherein the graph further comprises nodes representing a plurality of source domains hosting the computer files, wherein distinct source domains are represented by distinct nodes in the graph, wherein a node representing a source domain is connected to nodes representing computer files hosted by that source domain through edges, wherein a prior for a node representing a source domain comprises an assessment of a likelihood for the source domain hosting malware based on the domain knowledge.

19. A non-transitory computer-readable storage medium encoded with executable computer program code for detecting malicious computer files, the computer program code comprising program code for:
- generating a graph comprising nodes representing a plurality of clients and computer files residing thereon, wherein distinct clients and distinct computer files are represented by distinct nodes in the graph, wherein a node representing a client is connected to nodes representing computer files residing on that client through edges;
  
  determining priors for nodes in the graph and edge potentials for edges in the graph based on domain knowledge, wherein a prior for a node representing a client comprises an assessment of a likelihood of the client getting infected by malware based on the domain knowledge, a prior for a node representing a computer file comprises an assessment of a likelihood of the computer file being malware based on the domain knowledge, and an edge potential reflects a relationship between nodes connected by an associated edge based on the domain knowledge;
  
  iteratively propagating a probability of a computer file being legitimate among the nodes by transmitting messages along the edges in the graph, wherein a message transmitted by a node is generated based on a prior of the node and messages received by the node during any previous iterations; and
  
  determining whether a computer file is classified as malicious based on a probability associated with a corresponding node in the graph.
- View Dependent Claims (20)
- - 20. The non-transitory computer-readable storage medium of claim 19, wherein the graph further comprises nodes representing a plurality of source domains hosting the computer files, wherein distinct source domains are represented by distinct nodes in the graph, wherein a node representing a source domain is connected to nodes representing computer files hosted by that source domain through edges, wherein a prior for a node representing a source domain comprises an assessment of a likelihood for the source domain hosting malware based on the domain knowledge.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
CA, Inc. (d/b/a CA Technologies) (Broadcom, Inc.)
Original Assignee
Symantec Corporation (NortonLifeLock Inc.)
Inventors
Wright, Adam, Chau, Duen Horng
Primary Examiner(s)
Flynn, Nathan
Assistant Examiner(s)
DOAN, TRANG T

Application Number

US12/710,324
Time in Patent Office

1,037 Days
Field of Search

None
US Class Current

726/24
CPC Class Codes

G06F 21/56 Computer malware detection ...

Inferring file and website reputations by belief propagation leveraging machine reputation

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

180 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

Inferring file and website reputations by belief propagation leveraging machine reputation

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

180 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others