Apparatus and methods for scalable object clustering
First Claim
1. An apparatus configured to efficiently group a set of strings into clusters of related strings, the apparatus comprising:
- data storage configured to store computer-readable code and data;
a processor configured to access the data storage and to execute said computer-readable code;
computer-readable code configured to receive the set of strings;
computer-readable code configured to determine a binary output of an evaluation function between a pair of strings by steps including (i) generating a hash table based on a first string, (ii) matching sub-strings of a second string against the first string using the hash table, (iii) recording matches in a list, and (iv) applying a threshold based at least in part on a length of common substrings between the first and second strings; and
computer-readable code configured to group the strings in the set into clusters by a procedure which, for each string that does not already belong to a cluster, determines the binary output of the evaluation function between the string and each other string in the set that do not yet belong to any cluster.
1 Assignment
0 Petitions
Accused Products
Abstract
One embodiment relates to an apparatus configured to efficiently group a set of strings into clusters of related strings. Data storage is configured to store computer-readable code and data, and a processor is configured to access the data storage and to execute said computer-readable code. Computer-readable code is configured to receive the set of strings, determine an evaluation function between pairs of strings in said set, and group the strings into clusters, wherein determining the evaluation function between pairs of strings utilizes hash tables. Another embodiment relates to a computer-implemented method of efficiently grouping a set of strings into clusters of related strings based on rules of inference. Other embodiments and features are also disclosed.
-
Citations
19 Claims
-
1. An apparatus configured to efficiently group a set of strings into clusters of related strings, the apparatus comprising:
-
data storage configured to store computer-readable code and data; a processor configured to access the data storage and to execute said computer-readable code; computer-readable code configured to receive the set of strings; computer-readable code configured to determine a binary output of an evaluation function between a pair of strings by steps including (i) generating a hash table based on a first string, (ii) matching sub-strings of a second string against the first string using the hash table, (iii) recording matches in a list, and (iv) applying a threshold based at least in part on a length of common substrings between the first and second strings; and computer-readable code configured to group the strings in the set into clusters by a procedure which, for each string that does not already belong to a cluster, determines the binary output of the evaluation function between the string and each other string in the set that do not yet belong to any cluster. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computer-implemented method of efficiently grouping a set of strings into clusters of related strings, the method comprising:
-
receiving the set of strings; determining a binary output of an evaluation function between a pair of strings by steps including (i) generating a hash table based on a first string, (ii) matching sub-strings of a second string against the first string using the hash table, (iii) recording matches in a list, and (iv) applying a threshold based at least in part on a length of common substrings between the first and second strings; and grouping the strings into clusters by a procedure which, for each string in said set that does not already belong to a cluster, determines the binary output of the evaluation function between the string and other strings in the set that do not yet belong to any cluster. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
-
Specification