Managing an archive for approximate string matching
First Claim
1. A method for managing an archive for determining approximate matches associated with strings occurring in records, the method including:
- determining a set of strings occurring in the records, the set of strings including a first string;
generating, for each of the strings in the set, a plurality of deletion variants that are each generated by deleting one or more characters from the corresponding string;
for the first string, identifying one or more potentially matching strings in the set of strings, each potentially matching string of the potentially matching strings identified in response to determining that any deletion variant of the first string matches any deletion variant of the potentially matching string;
for each of the potentially matching strings, calculating a corresponding match score;
for at least some of the potentially matching strings, storing a record in the archive identifying the first string, the potentially matching string, and the match score;
determining a count of occurrences of the first string in the records;
for each of the potentially matching strings, determining a count of occurrences of the respective potentially matching string in the records; and
generating a significance value for the first string based on a sum of at least the count of occurrences of the string and the count of occurrences of each of the one or more potentially matching strings.
4 Assignments
0 Petitions
Accused Products
Abstract
In one aspect, in general, a method is described for managing an archive. The archive is used for determining approximate matches associated with strings occurring in records. The method includes processing records to determine a set of string representations that correspond to strings occurring in the records. The method also includes generating, for each of at least some of the string representations in the set, a plurality of close representations that are each generated from at least some of the same characters in the string. The method also includes storing entries in the archive. Each stored entry represents a potential approximate match between at least two strings based on their respective close representations.
85 Citations
45 Claims
-
1. A method for managing an archive for determining approximate matches associated with strings occurring in records, the method including:
-
determining a set of strings occurring in the records, the set of strings including a first string; generating, for each of the strings in the set, a plurality of deletion variants that are each generated by deleting one or more characters from the corresponding string; for the first string, identifying one or more potentially matching strings in the set of strings, each potentially matching string of the potentially matching strings identified in response to determining that any deletion variant of the first string matches any deletion variant of the potentially matching string; for each of the potentially matching strings, calculating a corresponding match score; for at least some of the potentially matching strings, storing a record in the archive identifying the first string, the potentially matching string, and the match score; determining a count of occurrences of the first string in the records; for each of the potentially matching strings, determining a count of occurrences of the respective potentially matching string in the records; and generating a significance value for the first string based on a sum of at least the count of occurrences of the string and the count of occurrences of each of the one or more potentially matching strings. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
-
-
22. A computer program, stored on a non-transitory computer-readable medium, for managing an archive for determining approximate matches associated with strings occurring in records, the computer program including instructions for causing a computer to:
-
determine a set of s strings occurring in the records, the set of strings including a first string; generate, for each of the strings in the set, a plurality of deletion variants that are each generated by deleting one or more characters from the corresponding string; for the first string, identify one or more potentially matching strings in the set of strings, each potentially matching string of the potentially matching strings identified in response to determining that any deletion variant of the first string matches any deletion variant of the potentially matching string; for each of the potentially matching strings, calculate a corresponding match score; for at least some of the potentially matching strings, store a record in the archive identifying the first string, the potentially matching string, and the match score; determine a count of occurrences of the first string in the records; for each of the potentially matching strings, determine a count of occurrences of the respective potentially matching string in the records; and generate a significance value for the first string based on a sum of at least the count of occurrences of the string and the count of occurrences of each of the one or more potentially matching strings. - View Dependent Claims (23, 24, 25, 26, 27, 28, 29)
-
-
30. A system for managing an archive for determining approximate matches associated with strings occurring in records, the system including:
-
means for determining a set strings occurring in the records, the set of strings including a first string; means for generating, for each of the strings in the set, a plurality of deletion variants that are each generated by deleting one or more characters from the corresponding string; means for identifying, for the first string, one or more potentially matching strings in the set of strings, each potentially matching string of the potentially matching strings identified in response to determining that any deletion variant of the first string matches any deletion variant of the potentially matching string; means for calculating, for each of the potentially matching strings, a corresponding match score; means for storing, for at least some of the potentially matching strings, a record in the archive identifying the first string, the potentially matching string, and the match score; means for determining a count of occurrences of the first string in the records; means for determining, for each of the potentially matching strings, a count of occurrences of the respective potentially matching string in the records; and means for generating a significance value for the first string based on a sum of at least the count of occurrences of the string and the count of occurrences of each of the one or more potentially matching strings. - View Dependent Claims (31, 32, 33, 34, 35, 36, 37)
-
-
38. A system for managing an archive for determining approximate matches associated with strings occurring in records, the system including:
-
a data source storing records; a computer system configured to determine a set of strings occurring in the records, the set of strings including a first string; generate, for each of the strings in the set, a plurality of deletion variants that are each generated by deleting one or more characters from the corresponding string; for the first string, identify one or more potentially matching strings in the set of strings, each potentially matching string of the potentially matching strings identified in response to determining that any deletion variant of the first string matches any deletion variant of the potentially matching string; for each of the potentially matching strings, calculate a corresponding match score; for at least some of the potentially matching strings, store a record in the archive identifying the first string, the potentially matching string, and the match score; determine a count of occurrences of the first string in the records; determine, for each of the potentially matching strings, a count of occurrences of the respective potentially matching string in the records; and generate a significance value for the first string based on a sum of at least the count of occurrences of the string and the count of occurrences of each of the one or more potentially matching strings; and a data store coupled to the computer system to store an archive including entries that each represent a potential approximate match between at least two strings. - View Dependent Claims (39, 40, 41, 42, 43, 44, 45)
-
Specification