System and method for determining internal parameters of a data clustering program

US 20030204484A1
Filed: 03/14/2003
Published: 10/30/2003
Est. Priority Date: 04/26/2002
Status: Active Grant

First Claim

Patent Images

1. A method for determining an internal parameter of a data clustering program for clustering data records, comprising:

inputting user data indicative of a similarity of pairs of data records;

calculating similarity values for the pairs of data records based on a default value of the internal parameter; and

determining a similarity threshold for the similarity values corresponding to the user data.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and associated method for tuning a data clustering program to a clustering task, determine at least one internal parameter of a data clustering program. The determination of one or more of the internal parameters of the data clustering program occurs before the clustering begins. Consequently, clustering does not need to be performed iteratively, thus improving clustering program performance in terms of the required processing time and processing resources. The system provides pairs of data records; the user indicates whether or not these data records should belong to the same cluster. The similarity values of the records of the selected pairs are calculated based on the default parameters of the clustering program. From the resulting similarity values, an optimal similarity threshold is determined. When the optimization criterion does not yield a single optimal similarity threshold range, equivalent candidate ranges are selected. To select one of the candidate ranges, pairs of data records having a calculated similarity value within the critical region are offered to the user.

49 Citations

View as Search Results

30 Claims

1. A method for determining an internal parameter of a data clustering program for clustering data records, comprising:
- inputting user data indicative of a similarity of pairs of data records;
  
  calculating similarity values for the pairs of data records based on a default value of the internal parameter; and
  
  determining a similarity threshold for the similarity values corresponding to the user data.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, further comprising setting the internal parameter as the similarity threshold.
  - 3. The method of claim 2, further comprising storing the optimal similarity threshold.
  - 4. The method of claim 1, wherein determining the similarity threshold comprising determining a first candidate value and a second candidate value.
  - 5. The method of claim 4, further comprising inputting additional user data for pairs of data records having similarity values between the first and second candidate values in order to evaluate the candidate values.
  - 6. The method of claim 1, further comprising determining a first subset of pairs of data records that contains similar data records that have a similarity value which is greater than the similarity threshold.
  - 7. The method of claim 6, further comprising determining a second subset of pairs of data records that contains dissimilar data records that have a similarity value which is greater than the similarity threshold.
  - 8. The method of claim 7, further comprising determining a third subset of pairs of data records that contains similar data records that have a similarity value which is less than the similarity threshold.
  - 9. The method of claim 8, further comprising determining a fourth subset of pairs of data records that contains dissimilar data records that have a similarity value which is less than the similarity threshold.
  - 10. The method of claim 9, further comprising, for each of the first, second, third, and fourth subsets, recalculating the similarity values of the pairs of data records.

11. A computer program product having instruction codes for determining an internal parameter of a data clustering program for clustering data records, comprising:
- a first set of instruction codes for inputting user data indicative of a similarity of pairs of data records;
  
  a second set of instruction codes for calculating similarity values for the pairs of data records based on a default value of the internal parameter; and
  
  a third set of instruction codes for determining a similarity threshold for the similarity values corresponding to the user data.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 12. The computer program product of claim 11, further comprising a fourth set of instruction codes for setting the internal parameter as the similarity threshold.
  - 13. The computer program product of claim 12, further comprising a memory for storing the optimal similarity threshold.
  - 14. The computer program product of claim 11, wherein the third set of instruction codes determines a first candidate value and a second candidate value.
  - 15. The computer program product of claim 14, further comprising additional user data for pairs of data records having similarity values between the first and second candidate values.
  - 16. The computer program product of claim 11, further comprising a fifth set of instruction codes for determining a first subset of pairs of data records that contains similar data records that have a similarity value which is greater than the similarity threshold.
  - 17. The computer program product of claim 16, wherein the fifth set of instruction codes determines a second subset of pairs of data records that contains dissimilar data records that have a similarity value which is greater than the similarity threshold.
  - 18. The computer program product of claim 17, wherein the fifth set of instruction codes further determines a third subset of pairs of data records that contains similar data records that have a similarity value which is less than the similarity threshold.
  - 19. The computer program product of claim 18, wherein the fifth set of instruction codes further determines a fourth subset of pairs of data records that contains dissimilar data records that have a similarity value which is less than the similarity threshold.
  - 20. The computer program product of claim 19, wherein the fifth set of instruction codes recalculates the similarity values of the pairs of data records for each of the first, second, third, and fourth subsets.

21. A system for determining an internal parameter of a data clustering program for clustering data records, comprising:
- means for inputting user data indicative of a similarity of pairs of data records;
  
  means for calculating similarity values for the pairs of data records based on a default value of the internal parameter; and
  
  means for determining a similarity threshold for the similarity values corresponding to the user data.
- View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30)
- - 22. The system of claim 21, further comprising means for setting the internal parameter as the similarity threshold.
  - 23. The system of claim 22, further comprising a storage for storing the optimal similarity threshold.
  - 24. The system of claim 21, wherein the means for determining the similarity threshold determines a first candidate value and a second candidate value.
  - 25. The system of claim 24, further comprising additional user data for pairs of data records having similarity values between the first and second candidate values.
  - 26. The system of claim 21, further comprising means for determining a first subset of pairs of data records that contains similar data records that have a similarity value which is greater than the similarity threshold.
  - 27. The system of claim 26, further comprising means for determining a second subset of pairs of data records that contains dissimilar data records that have a similarity value which is greater than the similarity threshold.
  - 28. The system of claim 27, further comprising means for determining a third subset of pairs of data records that contains similar data records that have a similarity value which is less than the similarity threshold.
  - 29. The system of claim 28, further comprising means for determining a fourth subset of pairs of data records that contains dissimilar data records that have a similarity value which is less than the similarity threshold.
  - 30. The system of claim 29, wherein the means for determining the first, second, third, and fourth subsets recalculates the similarity values of the pairs of data records for each of the first, second, third, and fourth subsets.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Charpiot, Boris, Hartel, Barbara, Maier, Thilo, Lingenfelder, Christoph

Granted Patent

US 7,177,863 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/1
CPC Class Codes

G06F 16/35   Clustering; Classification

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99936   Pattern matching access

Y10S 707/99937   Sorting

Y10S 707/99944   Object-oriented database st...

System and method for determining internal parameters of a data clustering program

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

49 Citations

30 Claims

Specification

Use Cases

Quick Links

Others

System and method for determining internal parameters of a data clustering program

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

49 Citations

30 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others