Systems and methods for manipulation of inexact semi-structured data
First Claim
1. A method for reducing a set of strings to approximately match to a first string by determining an edit distance between the first string and the set of strings is within a predetermined threshold, the method comprising:
- (a) receiving, by a device, a request to approximately match a first string with a set of strings using a predetermined edit distance;
(b) generating, by a device, a difference histogram comprising a distribution of a difference in a first number of occurrences of each character of a character set in the first string of the request and a second number of occurrences of each character of the character set in a second string of the set of strings, by incrementing each cell in the difference histogram corresponding to each character in the first string by a positive value and decrementing each cell in the difference histogram corresponding to each character set in the second string by a negative value;
(c) determining, by a device, via the difference histogram that a first sum of values across a plurality of cells of the difference histogram is greater than a predetermined threshold and that a second sum of negative values across a second plurality of cells of the difference histogram is less than a negative of the predetermined threshold; and
(d) identifying, by the device, the second string as having an edit distance from the first string greater than the predetermined edit distance in response to the determination.
0 Assignments
0 Petitions
Accused Products
Abstract
The data constraint framework solution of the present invention addresses data quality issues by standardizing, verifying, matching, consolidating and merging data records using powerful inexact matching logic and search reduction technologies. The data conditioning framework uses these technologies to more efficiently condition data to improve the quality of data and/or resolve quality data issues such as incomplete, inaccurate and duplicate data records. For example, the data conditioning framework is used to “cleanse” incorrect, incomplete and duplicate data from a data source, such as an information system. The data conditioning framework uses the following approximate searching and matching techniques to improve the efficiency of the approximate matching, reduce the search space for approximate matching, and improve the speed of executing approximate searches and matches: 1) inexact trimmed matching, 2) adaptive search ordering, 3) cascading search space reduction, 4) tiered and metric indexing, and 5) domain knowledge matching.
24 Citations
19 Claims
-
1. A method for reducing a set of strings to approximately match to a first string by determining an edit distance between the first string and the set of strings is within a predetermined threshold, the method comprising:
-
(a) receiving, by a device, a request to approximately match a first string with a set of strings using a predetermined edit distance; (b) generating, by a device, a difference histogram comprising a distribution of a difference in a first number of occurrences of each character of a character set in the first string of the request and a second number of occurrences of each character of the character set in a second string of the set of strings, by incrementing each cell in the difference histogram corresponding to each character in the first string by a positive value and decrementing each cell in the difference histogram corresponding to each character set in the second string by a negative value; (c) determining, by a device, via the difference histogram that a first sum of values across a plurality of cells of the difference histogram is greater than a predetermined threshold and that a second sum of negative values across a second plurality of cells of the difference histogram is less than a negative of the predetermined threshold; and (d) identifying, by the device, the second string as having an edit distance from the first string greater than the predetermined edit distance in response to the determination. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A system for reducing a set of strings to approximately match to a first string by determining an edit distance between the first string and the set of strings is within a predetermined threshold, the system comprising:
-
a device receiving a request to approximately match a first string with a set of strings using a predetermined edit distance; an approximate matching engine executing on a processor of the device, generating a difference histogram comprising a distribution of a difference in a first number of occurrences of each character of a character set in the first string of the request and a second number of occurrences of each character of the character set in a second string of the set of strings, by incrementing each cell in the difference histogram corresponding to each character in the first string by a positive value and decrementing each cell in the difference histogram corresponding to each character set in the second string by a negative value; and and wherein the approximate matching engine identifies the second string as having an edit distance from the first string greater than the predetermined edit distance in response to determining that a first sum of values across a plurality of cells of the difference histogram is greater than a predetermined threshold and that a second sum of negative values across a second plurality of cells of the difference histogram is less than a negative of the predetermined threshold. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A method for reducing a set of strings to approximately match to a string, the method comprising:
-
(a) receiving, by a device, a request to approximately match a first string to a plurality of strings; (b) determining, by the device, a first number of occurrences of each character of a character set in the first string; (c) determining, by the device, a second number of occurrences of each character of the character set in a second string of the plurality of strings; (d) generating, by the device based on the first number of occurrences minus the second number of occurrences for each character in the character set, a difference histogram comprising a difference in occurrence of each character in the character set between the first string and the second string by incrementing each cell in the difference histogram corresponding to each character in the first string by a positive value and decrementing each cell in the difference histogram corresponding to each character set in the second string by a negative value; and (e) skipping, by the device approximately matching the first string to the second string, responsive to determining via the difference histogram that a first sum of values across a plurality of cells of the difference histogram is less than a predetermined threshold and that a sum of negative values across a second plurality of cells of the difference histogram is less than a negative of the predetermined threshold.
-
Specification