Fast algorithms and metrics for comparing hierarchical clustering information trees and numerical vectors

US 8,095,543 B1
Filed: 07/31/2008
Issued: 01/10/2012
Est. Priority Date: 07/31/2008
Status: Expired due to Fees

First Claim

Patent Images

1. A method for determining a similarity between two data sets, comprising:

determining a first list of data clusters for a first hierarchically-organized data set;

determining a second list of data clusters for a second hierarchically-organized data set;

removing a master cluster from consideration if the first and second data sets have all common elements;

determining a similarity between the first and second data sets by calculating a maximum flow between the first list of data clusters and the second list of data clusters;

determining a maximum number of redundant elements for the first and second data sets; and

dividing the maximum number of redundant elements by the maximum matching flow to arrive at a distance metric.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In various embodiments, a method for determining a similarity between two data sets is disclosed, the steps of which include determining a first list of data clusters for a first hierarchically-organized data set, determining a second list of data clusters for a second hierarchically-organized data set, and determining a similarity between the first and second data sets by calculating a maximum flow between the first list of data clusters and the second list of data clusters.

Citations

12 Claims

1. A method for determining a similarity between two data sets, comprising:
- determining a first list of data clusters for a first hierarchically-organized data set;
  
  determining a second list of data clusters for a second hierarchically-organized data set;
  
  removing a master cluster from consideration if the first and second data sets have all common elements;
  
  determining a similarity between the first and second data sets by calculating a maximum flow between the first list of data clusters and the second list of data clusters;
  
  determining a maximum number of redundant elements for the first and second data sets; and
  
  dividing the maximum number of redundant elements by the maximum matching flow to arrive at a distance metric.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, wherein the first list of data clusters includes only a subset of the total available clusters for the first data set.
  - 3. The method of claim 2, the second list of data clusters includes only a subset of the total available clusters for the second data set.
  - 4. The method of claim 2, wherein determining a first list of data clusters includes excluding data clusters having a size below a minimum size threshold.
  - 5. The method of claim 2, wherein determining a first list of data clusters includes excluding data clusters having a size above a maximum size threshold.
  - 6. The method of claim 2, wherein the maximum flow is based upon cardinalities between the first and second data sets.

7. An electronic medium capable of being read by a computing device, the electronic medium containing instructions for:
- determining a first list of data clusters for a first hierarchically-organized data set;
  
  determining a second list of data clusters for a second hierarchically-organized data set;
  
  removing a master cluster from consideration if the first and second data sets have all common elements;
  
  determining a similarity between the first and second data sets by calculating a maximum flow between the first list of data clusters and the second list of data clusters;
  
  determining a maximum number of redundant elements for the first and second data sets; and
  
  dividing the maximum number of redundant elements by the maximum matching flow to arrive at a distance metric.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The electronic medium of claim 7, wherein the first list of data clusters includes only a subset of the total available clusters for the first data set.
  - 9. The electronic medium of claim 8, the second list of data clusters includes only a subset of the total available clusters for the second data set.
  - 10. The electronic medium of claim 8, wherein determining a first list of data clusters includes excluding data clusters having a size below a minimum size threshold.
  - 11. The electronic medium of claim 8, wherein determining a first list of data clusters includes excluding data clusters having a size above a maximum size threshold.
  - 12. The electronic medium of claim 8, wherein the maximum flow is based upon cardinalities between the first and second data sets.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
the united states of america as represented by the secretary of the navy
Original Assignee
the united states of america as represented by the secretary of the navy
Inventors
Gupta, Anjum
Primary Examiner(s)
AL HASHEMI, SANA A

Application Number

US12/183,926
Time in Patent Office

1,258 Days
Field of Search

707/749, 707/750, 707/754, 707/758, 707/778
US Class Current

707/749
CPC Class Codes

G06F 16/2458   Special types of queries, e...

G16B 30/00   ICT specially adapted for s...

G16B 40/00   ICT specially adapted for b...

G16B 40/30   Unsupervised data analysis

Fast algorithms and metrics for comparing hierarchical clustering information trees and numerical vectors

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

12 Claims

Specification

Solutions

Use Cases

Quick Links

Fast algorithms and metrics for comparing hierarchical clustering information trees and numerical vectors

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

12 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links