Apparatus and methods for scalable object clustering

US 8,566,317 B1
Filed: 01/06/2010
Issued: 10/22/2013
Est. Priority Date: 01/06/2010
Status: Active Grant

First Claim

Patent Images

1. An apparatus configured to efficiently group a set of strings into clusters of related strings, the apparatus comprising:

data storage configured to store computer-readable code and data;

a processor configured to access the data storage and to execute said computer-readable code;

computer-readable code configured to receive the set of strings;

computer-readable code configured to determine a binary output of an evaluation function between a pair of strings by steps including (i) generating a hash table based on a first string, (ii) matching sub-strings of a second string against the first string using the hash table, (iii) recording matches in a list, and (iv) applying a threshold based at least in part on a length of common substrings between the first and second strings; and

computer-readable code configured to group the strings in the set into clusters by a procedure which, for each string that does not already belong to a cluster, determines the binary output of the evaluation function between the string and each other string in the set that do not yet belong to any cluster.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

One embodiment relates to an apparatus configured to efficiently group a set of strings into clusters of related strings. Data storage is configured to store computer-readable code and data, and a processor is configured to access the data storage and to execute said computer-readable code. Computer-readable code is configured to receive the set of strings, determine an evaluation function between pairs of strings in said set, and group the strings into clusters, wherein determining the evaluation function between pairs of strings utilizes hash tables. Another embodiment relates to a computer-implemented method of efficiently grouping a set of strings into clusters of related strings based on rules of inference. Other embodiments and features are also disclosed.

Citations

19 Claims

1. An apparatus configured to efficiently group a set of strings into clusters of related strings, the apparatus comprising:
- data storage configured to store computer-readable code and data;
  
  a processor configured to access the data storage and to execute said computer-readable code;
  
  computer-readable code configured to receive the set of strings;
  
  computer-readable code configured to determine a binary output of an evaluation function between a pair of strings by steps including (i) generating a hash table based on a first string, (ii) matching sub-strings of a second string against the first string using the hash table, (iii) recording matches in a list, and (iv) applying a threshold based at least in part on a length of common substrings between the first and second strings; and
  
  computer-readable code configured to group the strings in the set into clusters by a procedure which, for each string that does not already belong to a cluster, determines the binary output of the evaluation function between the string and each other string in the set that do not yet belong to any cluster.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The apparatus of claim 1, wherein given that the set of strings is defined as {S_j|j=1 to m}, the computer-readable code is configured, for each string S_j, to determine if S_jbelongs to a cluster already, and if so, then to skip to the next string, and if not, then to perform a loop of instructions relating to the string S_j.
  - 3. The apparatus of claim 2, wherein the loop of instructions comprises determining if a pair of strings are sufficiently related using a hash table generated based on the string S_j, and if so, assigning one or both strings to a cluster.
  - 4. The apparatus of claim 3, wherein the cluster is newly created if the string S_jdoes not yet belong to any cluster.
  - 5. The apparatus of claim 1, wherein the set of strings comprise binary strings derived from malware payloads.
  - 6. The apparatus of claim 1, wherein the set of strings comprise ASCII strings derived from malicious scripts.
  - 7. The apparatus of claim 1, wherein the set of strings comprise fingerprints derived from spam messages.
  - 8. The apparatus of claim 1, wherein the set of strings comprise document fingerprints derived from sensitive documents.
  - 9. The apparatus of claim 1, wherein if the binary output of the evaluation function between the string and another string indicates a match, then said another string is assigned to a same cluster as the string.
  - 10. The apparatus of claim 9, wherein the same cluster is a new cluster if the string does not belong to any cluster.

11. A computer-implemented method of efficiently grouping a set of strings into clusters of related strings, the method comprising:
- receiving the set of strings;
  
  determining a binary output of an evaluation function between a pair of strings by steps including (i) generating a hash table based on a first string, (ii) matching sub-strings of a second string against the first string using the hash table, (iii) recording matches in a list, and (iv) applying a threshold based at least in part on a length of common substrings between the first and second strings; and
  
  grouping the strings into clusters by a procedure which, for each string in said set that does not already belong to a cluster, determines the binary output of the evaluation function between the string and other strings in the set that do not yet belong to any cluster.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
- - 12. The method of claim 11, wherein given that the set of strings is defined as {S_j|j=1 to m}, the method determines, for each string S_j, if S_jbelongs to a cluster already, and if so, then skips to the next string, and if not, then performs a loop of instructions relating to the string S_j.
  - 13. The method of claim 12, wherein the loop of instructions comprises determining if a pair of strings are sufficiently related using a hash table generated based on the string S_j, and if so, assigning one or both strings to a cluster.
  - 14. The method of claim 13, wherein the cluster is newly created if the string S_jdoes not yet belong to any cluster.
  - 15. The method of claim 11, wherein the set of strings comprise binary strings derived from malware payloads.
  - 16. The method of claim 11, wherein the set of strings comprise ASCII strings derived from malicious scripts.
  - 17. The method of claim 11, wherein the set of strings comprise fingerprints derived from spam messages.
  - 18. The method of claim 11, wherein the set of strings comprise document fingerprints derived from sensitive documents.
  - 19. The method of claim 11, wherein if the binary output of the evaluation function between the string and another string indicates a match, then said another string is assigned to a same cluster as the string, wherein the same cluster is a new cluster if the string does not belong to any cluster.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Trend Micro Inc.
Original Assignee
Trend Micro Inc.
Inventors
Ren, Liwei, Yan, Wei
Primary Examiner(s)
PYO, MONICA M

Application Number

US12/683,350
Time in Patent Office

1,385 Days
Field of Search

707/737, 707/738, 707999003-999004, 726/23, 726/26
US Class Current

707/737
CPC Class Codes

G06F 16/355 Class or cluster creation o...

Apparatus and methods for scalable object clustering

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Apparatus and methods for scalable object clustering

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links