Data clustering based on variant token networks
First Claim
1. A method, including:
- receiving data records, the received data records each including one or more values in one or more fields; and
processing the received data records to identify one or more data clusters of two or more data records, where the data clusters are identified based on candidate data records that are identified based on a network representing identified tokens, the processing including;
identifying tokens that each include at least one value or fragment of a value in a field or a combination of fields of the received data records;
generating the network representing the identified tokens, with nodes of the network representing individual tokens and edges of the network each representing a variant relationship between tokens;
identifying, for each received data record to be associated with a data cluster, a corresponding set of candidate data records, such that candidate data records that are in the same set each include one or more tokens from the same group of tokens represented by a subset of connected nodes in the generated network; and
for at least one candidate data record in the set of candidate data records corresponding to a received data record, determining whether or not the received data record satisfies a cluster association criterion for a candidate data cluster to which the candidate data record belongs.
3 Assignments
0 Petitions
Accused Products
Abstract
Received data records, each including one or more values in one or more fields, are processed to identify one or more data clusters. The processing includes: identifying tokens that each include at least one value or fragment of a value in a field or a combination of fields; generating a network representing the identified tokens, with nodes of the network representing tokens and edges of the network each representing a variant relationship between tokens; and generating a graphical representation of the network with different subsets of nodes distinguished based at least in part on values associated with nodes, where a value associated with a particular node quantifies a count of a number of instances of the token represented by that particular node appearing within the received data records.
102 Citations
60 Claims
-
1. A method, including:
-
receiving data records, the received data records each including one or more values in one or more fields; and processing the received data records to identify one or more data clusters of two or more data records, where the data clusters are identified based on candidate data records that are identified based on a network representing identified tokens, the processing including; identifying tokens that each include at least one value or fragment of a value in a field or a combination of fields of the received data records; generating the network representing the identified tokens, with nodes of the network representing individual tokens and edges of the network each representing a variant relationship between tokens; identifying, for each received data record to be associated with a data cluster, a corresponding set of candidate data records, such that candidate data records that are in the same set each include one or more tokens from the same group of tokens represented by a subset of connected nodes in the generated network; and for at least one candidate data record in the set of candidate data records corresponding to a received data record, determining whether or not the received data record satisfies a cluster association criterion for a candidate data cluster to which the candidate data record belongs. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A computer program stored on a non-transitory computer-readable medium, the computer program including instructions for causing a computing system to:
-
receive data records, the received data records each including one or more values in one or more fields; and process the received data records to identify one or more data clusters of two or more data records, where the data clusters are identified based on candidate data records that are identified based on a network representing identified tokens, the processing including; identifying tokens that each include at least one value or fragment of a value in a field or a combination of fields of the received data records; generating the network representing the identified tokens, with nodes of the network representing individual tokens and edges of the network each representing a variant relationship between tokens; identifying, for each received data record to be associated with a data cluster, a corresponding set of candidate data records, such that candidate data records that are in the same set each include one or more tokens from the same group of tokens represented by a subset of connected nodes in the generated network; and for at least one candidate data record in the set of candidate data records corresponding to a received data record, determining whether or not the received data record satisfies a cluster association criterion for a candidate data cluster to which the candidate data record belongs. - View Dependent Claims (19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)
-
-
17. A computing system, including:
-
an input device or port configured to receive data records, the received data records each including one or more values in one or more fields; and at least one processor configured to process the received data records to identify one or more data clusters of two or more data records, where the data clusters are identified based on candidate data records that are identified based on a network representing identified tokens, the processing including; identifying tokens that each include at least one value or fragment of a value in a field or a combination of fields of the received data records; generating the network representing the identified tokens, with nodes of the network representing individual tokens and edges of the network each representing a variant relationship between tokens; identifying, for each received data record to be associated with a data cluster, a corresponding set of candidate data records, such that candidate data records that are in the same set each include one or more tokens from the same group of tokens represented by a subset of connected nodes in the generated network; and for at least one candidate data record in the set of candidate data records corresponding to a received data record, determining whether or not the received data record satisfies a cluster association criterion for a candidate data cluster to which the candidate data record belongs. - View Dependent Claims (33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46)
-
-
18. A computing system, including:
-
means for receiving data records, the received data records each including one or more values in one or more fields; and means for processing the received data records to identify one or more data clusters of two or more data records, where the data clusters are identified based on candidate data records that are identified based on a network representing identified tokens, the processing including; identifying tokens that each include at least one value or fragment of a value in a field or a combination of fields of the received data records; generating the network representing the identified tokens, with nodes of the network representing individual tokens and edges of the network each representing a variant relationship between tokens; identifying, for each received data record to be associated with a data cluster, a corresponding set of candidate data records, such that candidate data records that are in the same set each include one or more tokens from the same group of tokens represented by a subset of connected nodes in the generated network; and for at least one candidate data record in the set of candidate data records corresponding to a received data record, determining whether or not the received data record satisfies a cluster association criterion for a candidate data cluster to which the candidate data record belongs. - View Dependent Claims (47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60)
-
Specification