Managing an archive for approximate string matching
First Claim
1. A method for managing an archive for determining approximate matches associated with strings occurring in records, the method including:
- processing records to determine a set of string representations that correspond to strings occurring in the records;
for each of at least some of the string representations in the set,generating a plurality of close representations for that string representation, wherein each of the plurality of close representations is generated from at least some of the same characters in at least one of the strings occurring in at least one of the records processed to determine that string representation;
comparing first close representations that are each generated from at least some characters in at least a first one of the strings occurring in at least one of the records processed to determine a first one of the string representations to second close representations that are each generated from at least some characters in at least a second one of the strings occurring in at least one of the records processed to determine a second one of the string representations, wherein the first close representations are for the first one of the string representations, and wherein the second close representations are for the second one of the string representations;
identifying which one of the first close representations that are each generated from at least some characters in at least the first one of the strings occurring in at least one of the records processed to determine the first one of the string representations corresponds to which one of the second close representations that are each generated from at least some characters in at least the second one of the strings occurring in at least one of the records processed to determine the second one of the string representations; and
based on identified correspondences between close representations, storing entries in an archive that each represent a potential approximate match between at least two strings based on their respective close representations.
4 Assignments
0 Petitions
Accused Products
Abstract
In one aspect, in general, a method is described for managing an archive for determining approximate matches associated with strings occurring in records. The method includes: processing records to determine a set of string representations that correspond to strings occurring in the records; generating, for each of at least some of the string representations in the set, a plurality of close representations that are each generated from at least some of the same characters in the string; and storing entries in the archive that each represent a potential approximate match between at least two strings based on their respective close representations.
91 Citations
25 Claims
-
1. A method for managing an archive for determining approximate matches associated with strings occurring in records, the method including:
-
processing records to determine a set of string representations that correspond to strings occurring in the records; for each of at least some of the string representations in the set, generating a plurality of close representations for that string representation, wherein each of the plurality of close representations is generated from at least some of the same characters in at least one of the strings occurring in at least one of the records processed to determine that string representation; comparing first close representations that are each generated from at least some characters in at least a first one of the strings occurring in at least one of the records processed to determine a first one of the string representations to second close representations that are each generated from at least some characters in at least a second one of the strings occurring in at least one of the records processed to determine a second one of the string representations, wherein the first close representations are for the first one of the string representations, and wherein the second close representations are for the second one of the string representations; identifying which one of the first close representations that are each generated from at least some characters in at least the first one of the strings occurring in at least one of the records processed to determine the first one of the string representations corresponds to which one of the second close representations that are each generated from at least some characters in at least the second one of the strings occurring in at least one of the records processed to determine the second one of the string representations; and based on identified correspondences between close representations, storing entries in an archive that each represent a potential approximate match between at least two strings based on their respective close representations. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 25)
-
-
22. A computer program, stored on a non-transitory computer-readable medium, for managing an archive for determining approximate matches associated with strings occurring in records, the computer program including instructions for causing a computer to:
-
process records to determine a set of string representations that correspond to strings occurring in the records; for each of at least some of the string representations in the set, generate a plurality of close representations for that string representation, wherein each of the plurality of close representations is generated from at least some of the same characters in at least one of the strings occurring in at least one of the records processed to determine that string representation; compare first close representations that are each generated from at least some characters in at least a first one of the strings occurring in at least one of the records processed to determine a first one of the string representations to second close representations that are each generated from at least some characters in at least a second one of the strings occurring in at least one of the records processed to determine a second one of the string representations, wherein the first close representations are for the first one of the string representations, and wherein the second close representations are for the second one of the string representations; identify which one of the first close representations that are each generated from at least some characters in at least the first one of the strings occurring in at least one of the records processed to determine the first one of the string representations corresponds to which one of the second close representations that are each generated from at least some characters in at least the second one of the strings occurring in at least one of the records processed to determine the second one of the string representations; and based on identified correspondences between close representations, store entries in an archive that each represent a potential approximate match between at least two strings based on their respective close representations.
-
-
23. A system for managing an archive for determining approximate matches associated with strings occurring in records, the system including:
-
means for processing records to determine a set of string representations that correspond to strings occurring in the records; for each of at least some of the string representations in the set, means for generating a plurality of close representations for that string representation, wherein each of the plurality of close representations is generated from at least some of the same characters in at least one of the strings occurring in at least one of the records processed to determine that string representation; means for comparing first close representations that are each generated from at least some characters in at least a first one of the strings occurring in at least one of the records processed to determine a first one of the string representations to second close representations that are each generated from at least some characters in at least a second one of the strings occurring in at least one of the records processed to determine a second one of the string representations, wherein the first close representations are for the first one of the string representations, and wherein the second close representations are for the second one of the string representations; means for identifying which one of the first close representations that are each generated from at least some characters in at least the first one of the strings occurring in at least one of the records processed to determine the first one of the string representations corresponds to which one of the second close representations that are each generated from at least some characters in at least the second one of the strings occurring in at least one of the records processed to determine the second one of the string representations; and based on identified correspondences between close representations, means for storing entries in an archive that each represent a potential approximate match between at least two strings based on their respective close representations.
-
-
24. A system for managing an archive for determining approximate matches associated with strings occurring in records, the system including:
-
a data source storing records; a computer system configured to; process records to determine a set of string representations that correspond to strings occurring in the records; for each of at least some of the string representations in the set, generate a plurality of close representations for that string representation, wherein each of the plurality of close representations is generated from at least some of the same characters in at least one of the strings occurring in at least one of the records processed to determine that string representation; compare first close representations that are each generated from at least some characters in at least a first one of the strings occurring in at least one of the records processed to determine a first one of the string representations to second close representations that are each generated from at least some characters in at least a second one of the strings occurring in at least one of the records processed to determine a second one of the string representations, wherein the first close representations are for the first one of the string representations, and wherein the second close representations are for the second one of the string representations; identify which one of the first close representations that are each generated from at least some characters in at least the first one of the strings occurring in at least one of the records processed to determine the first one of the string representations corresponds to which one of the second close representations that are each generated from at least some characters in at least the second one of the strings occurring in at least one of the records processed to determine the second one of the string representations; and based on identified correspondences between close representations, store entries in an archive that each represent a potential approximate match between at least two strings based on their respective close representations.
-
Specification