Systems and methods for clustering data

US 9,684,705 B1
Filed: 03/14/2014
Issued: 06/20/2017
Est. Priority Date: 03/14/2014
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented method for clustering data, at least a portion of the method being performed by a computing device comprising at least one processor, the method comprising:

identifying a plurality of samples that have been grouped into existing clusters, the existing clusters comprising a predetermined cluster radius;

locating a sample, from within the plurality of samples, that is a centroid of an existing cluster;

restructuring the existing cluster to generate a new cluster that includes samples with matching attributes by;

locating another sample that is, among the plurality of samples, next closest to the centroid relative to a most recently located sample;

determining whether an attribute of the next-closest sample matches an attribute of the centroid;

determining whether to adjust the radius of the existing cluster based on whether the attribute of the next-closest sample matches the attribute of the centroid;

repeating the steps of locating the next-closest sample, determining whether the attributes match, and determining whether to adjust the radius of the existing cluster until the attribute of the next-closest sample does not match the attribute of the centroid, and when the attribute of the next-closest sample does not match the attribute of the centroid, adjusting the radius of the existing cluster by setting the radius as a distance from the centroid to a most-recently located matching sample such that only samples with matching attributes are included within the new cluster; and

restructuring the existing clusters that have not been restructured by generating, using samples within the plurality of samples that are not included in the new cluster, additional new clusters until each sample within the plurality of samples is included within at least one new cluster, wherein for each additional new cluster;

the additional new cluster comprises a variable cluster radius instead of the predetermined cluster radius;

the additional new cluster includes at least one sample whose attribute matches an attribute of a centroid of the additional new cluster; and

the additional new clusters are generated by a computing system such that a single pass over the plurality of samples is needed to assign each sample to a new cluster without requiring additional computing resources that would be needed from the computing system for multiple iterations of a clustering algorithm.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer-implemented method for clustering data may include (1) identifying a plurality of samples, (2) locating a sample, from within the plurality of samples, that is a centroid of a cluster, (3) locating another sample that is, among the plurality of samples, next closest to the centroid relative to a most-recently located sample, (4) determining whether an attribute of the next-closest sample matches an attribute of the centroid, (5) determining whether to adjust a radius of the cluster based on whether the attribute of the next-closest sample matches the attribute of the centroid, and (6) repeating the steps of locating the next-closest sample, determining whether the attributes match, and determining whether to adjust the radius of the cluster, until the attribute of the next-closest sample does not match the attribute of the centroid. Various other methods, systems, and computer-readable media are also disclosed.

29 Citations

View as Search Results

16 Claims

1. A computer-implemented method for clustering data, at least a portion of the method being performed by a computing device comprising at least one processor, the method comprising:
- identifying a plurality of samples that have been grouped into existing clusters, the existing clusters comprising a predetermined cluster radius;
  
  locating a sample, from within the plurality of samples, that is a centroid of an existing cluster;
  
  restructuring the existing cluster to generate a new cluster that includes samples with matching attributes by;
  
  locating another sample that is, among the plurality of samples, next closest to the centroid relative to a most recently located sample;
  
  determining whether an attribute of the next-closest sample matches an attribute of the centroid;
  
  determining whether to adjust the radius of the existing cluster based on whether the attribute of the next-closest sample matches the attribute of the centroid;
  
  repeating the steps of locating the next-closest sample, determining whether the attributes match, and determining whether to adjust the radius of the existing cluster until the attribute of the next-closest sample does not match the attribute of the centroid, and when the attribute of the next-closest sample does not match the attribute of the centroid, adjusting the radius of the existing cluster by setting the radius as a distance from the centroid to a most-recently located matching sample such that only samples with matching attributes are included within the new cluster; and
  
  restructuring the existing clusters that have not been restructured by generating, using samples within the plurality of samples that are not included in the new cluster, additional new clusters until each sample within the plurality of samples is included within at least one new cluster, wherein for each additional new cluster;
  
  the additional new cluster comprises a variable cluster radius instead of the predetermined cluster radius;
  
  the additional new cluster includes at least one sample whose attribute matches an attribute of a centroid of the additional new cluster; and
  
  the additional new clusters are generated by a computing system such that a single pass over the plurality of samples is needed to assign each sample to a new cluster without requiring additional computing resources that would be needed from the computing system for multiple iterations of a clustering algorithm.
- View Dependent Claims (2, 3, 4, 5, 6, 15, 16)
- - 2. The method of claim 1, wherein the attribute of the centroid comprises at least one of:
    - a label with which the sample has been tagged that indicates the attribute of the sample; and
      
      a malware classification.
  - 3. The method of claim 1, wherein determining whether the attribute of the next-closest sample matches the attribute of the centroid comprises:
    - identifying the attribute of the next-closest sample and the attribute of the centroid; and
      
      comparing the attribute of the next-closest sample to the attribute of the centroid.
  - 4. The method of claim 1, wherein:
    - determining whether the attribute of the next-closest sample matches the attribute of the centroid comprises determining that the attribute of the next-closest sample matches the attribute of the centroid; and
      
      determining whether to adjust the radius of the existing cluster comprises determining to not adjust the radius.
  - 5. The method of claim 1, wherein:
    - when the attribute of the next-closest sample does not match the attribute of the centroid, the most-recently located matching sample is located within the radius; and
      
      adjusting the radius comprises decreasing the radius.
  - 6. The method of claim 1, wherein:
    - when the attribute of the next-closest sample does not match the attribute of the centroid, the most-recently located matching sample is located outside the radius; and
      
      adjusting the radius comprises increasing the radius.
  - 15. The method of claim 1, wherein generating the additional new clusters further comprises, for each additional new cluster, selecting a sample from within the plurality of samples that is not already included within a new cluster to be the centroid of the additional new cluster.
  - 16. The method of claim 1, wherein generating the additional new clusters until each sample within the plurality of samples is included within at least one new cluster is performed such that only a single pass over the plurality of samples is needed to assign each sample to a new cluster.

7. A system for clustering data samples, the system comprising:
- an identification module that identifies a plurality of samples that have been grouped into existing clusters, the existing clusters comprising a predetermined cluster radius;
  
  a location module that;
  
  locates a sample, from within the plurality of samples, that is a centroid of an existing cluster;
  
  locates another sample that is, among the plurality of samples, next closest to the centroid relative to a most recently located sample;
  
  a determination module that;
  
  determines whether an attribute of the next-closest sample matches an attribute of the centroid;
  
  determines whether to adjust the radius of the existing cluster based on whether the attribute of the next-closest sample matches the attribute of the centroid;
  
  a repetition module that;
  
  restructures the existing cluster to generate a new cluster that includes samples with matching attributes by repeating the steps of locating the next-closest sample, determining whether the attributes match, and determining whether to adjust the radius of the existing cluster until the attribute of the next-closest sample does not match the attribute of the centroid, and when the attribute of the next-closest sample does not match the attribute of the centroid, adjusts the radius of the existing cluster by setting the radius as a distance from the centroid to a most-recently located matching sample such that only samples with matching attributes are included within the new cluster;
  
  restructures the existing clusters that have not been restructured by generating, using samples within the plurality of samples that are not included in the new cluster, additional new clusters until each sample within the plurality of samples is included within at least one new cluster, wherein for each additional new cluster;
  
  the additional new cluster comprises a variable cluster radius instead of the predetermined cluster radius;
  
  the additional new cluster includes at least one sample whose attribute matches an attribute of a centroid of the additional new cluster; and
  
  the additional new clusters are generated by a computing system such that a single pass over the plurality of samples is needed to assign each sample to a new cluster without requiring additional computing resources that would be needed from the computing system for multiple iterations of a clustering algorithm; and
  
  at least one processor that executes the identification module, the determination module and the repetition module.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The system of claim 7, wherein the attribute of the centroid comprises at least one of:
    - a label with which the sample has been tagged that indicates the attribute of the sample; and
      
      a malware classification.
  - 9. The system of claim 7, wherein the determination module determines whether the attribute of the next-closest sample matches the attribute of the centroid by:
    - identifying the attribute of the next-closest sample and the attribute of the centroid; and
      
      comparing the attribute of the next-closest sample to the attribute of the centroid.
  - 10. The system of claim 7, wherein the determination module:
    - determines whether the attribute of the next-closest sample matches the attribute of the centroid by determining that the attribute of the next-closest sample matches the attribute of the centroid; and
      
      determines whether to adjust the radius of the existing cluster by determining to not adjust the radius.
  - 11. The system of claim 7, wherein:
    - when the attribute of the next-closest sample does not match the attribute of the centroid, the most-recently located matching sample is located within the radius; and
      
      the determination module adjusts the radius by decreasing the radius.
  - 12. The system of claim 7, wherein:
    - when the attribute of the next-closest sample does not match the attribute of the centroid, the most-recently located matching sample is located outside the radius; and
      
      the determination module adjusts the radius by increasing the radius.

13. A non-transitory computer-readable medium comprising one or more computer-executable instructions that, when executed by at least one processor of a computing device, cause the computing device to:
- identify a plurality of samples that have been grouped into existing clusters, the existing clusters comprising a predetermined cluster radius;
  
  locate a sample, from within the plurality of samples, that is a centroid of an existing cluster;
  
  restructure the existing cluster to generate a new cluster that includes samples with matching attributes by;
  
  locating another sample that is, among the plurality of samples, next closest to the centroid relative to a most recently located sample;
  
  determining whether an attribute of the next-closest sample matches an attribute of the centroid;
  
  determining whether to adjust the radius of the existing cluster based on whether the attribute of the next-closest sample matches the attribute of the centroid;
  
  repeating the steps of locating the next-closest sample, determining whether the attributes match, and determining whether to adjust the radius of the existing cluster until the attribute of the next-closest sample does not match the attribute of the centroid, and when the attribute of the next-closest sample does not match the attribute of the centroid, adjusting the radius of the existing cluster by setting the radius as a distance from the centroid to a most-recently located matching sample such that only samples with matching attributes are included within the new cluster; and
  
  restructure the existing clusters that have not been restructured by generating, using samples within the plurality of samples that are not included in the new cluster, additional new clusters until each sample within the plurality of samples is included within at least one new cluster, wherein for each additional new cluster;
  
  the additional new cluster comprises a variable cluster radius instead of the predetermined cluster radius;
  
  the additional new cluster includes at least one sample whose attribute matches an attribute of a centroid of the additional new cluster; and
  
  the additional new clusters are generated by a computing system such that a single pass over the plurality of samples is needed to assign each sample to a new cluster without requiring additional computing resources that would be needed from the computing system for multiple iterations of a clustering algorithm.
- View Dependent Claims (14)
- - 14. The non-transitory computer-readable medium of claim 13, wherein the computer-executable instructions further cause the computing device to:
    - determine whether the attribute of the next-closest sample matches the attribute of the centroid by determining that the attribute of the next-closest sample matches the attribute of the centroid; and
      
      determine whether to adjust the radius of the existing cluster by determining to not adjust the radius.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
CA, Inc. (d/b/a CA Technologies) (Broadcom, Inc.)
Original Assignee
Symantec Corporation (NortonLifeLock Inc.)
Inventors
Satish, Sourabh, Salinas, Govind
Primary Examiner(s)
Vital, Pierre
Assistant Examiner(s)
ANTOINE, ALICIA M

Application Number

US14/214,581
Time in Patent Office

1,194 Days
Field of Search

707737
US Class Current
CPC Class Codes

G06F 16/285   Clustering or classification

G06F 16/35   Clustering; Classification

G06F 16/358   Browsing; Visualisation the...

G06F 21/56   Computer malware detection ...

G06F 21/564   by virus signature recognition

Systems and methods for clustering data

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

29 Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for clustering data

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

29 Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links