Supplier deduplication engine
First Claim
1. A method for deduplication and grouping similar supplier names from a plurality of supplier names, comprising:
- correcting syntactical errors in said supplier names;
grouping the supplier names after said step of correcting syntactical errors;
capturing abbreviations of the supplier names;
correcting ordering, pronunciation and stemming errors in the supplier names;
calculating a name matching score between two of said supplier names using a matching algorithm, comprising the steps of;
grouping supplier names based on the first set of characters in the supplier names;
calculating a word matching score between corresponding words in two of said supplier names, comprising;
determining stems of said corresponding words;
determining sound codes of said determined stems using a modified metaphone algorithm;
determining a Levenshtein distance between said sound codes;
calculating a prefix score using said stems and calculating a sound score using said sound codes;
calculating a Levenshtein distance score using said determined Levenshtein distance and length of larger of said corresponding words;
selecting one of said prefix score, said sound score and said Levenshtein distance score as said word matching score based on comparisons with a set threshold;
calculating said name matching score between two of said supplier names based on said word matching score; and
comparing said name matching score with a threshold value to determine a match, and grouping said supplier names based on said match.
5 Assignments
0 Petitions
Accused Products
Abstract
Disclosed herein is a method of grouping similar supplier names together in a database. The syntactical errors in the supplier names are corrected. The supplier names are grouped after correcting the syntactical errors. The abbreviations in the supplier names are captured. The ordering, pronunciation and stemming errors in the supplier names are corrected. A matching algorithm that matches and compares two supplier names is applied that comprises the steps of grouping supplier names based on first set of characters in the supplier names and calculating a matching score between the two supplier using Levenshtein distance between the two supplier names, along with the supplier names'"'"' sound codes obtained from a modified metaphone algorithm, length of each word, position of matching and mismatching characters, and stem of words in the supplier names. The matching scores are compared with set thresholds in order to further group the supplier names into clusters.
21 Citations
18 Claims
-
1. A method for deduplication and grouping similar supplier names from a plurality of supplier names, comprising:
-
correcting syntactical errors in said supplier names; grouping the supplier names after said step of correcting syntactical errors; capturing abbreviations of the supplier names; correcting ordering, pronunciation and stemming errors in the supplier names; calculating a name matching score between two of said supplier names using a matching algorithm, comprising the steps of; grouping supplier names based on the first set of characters in the supplier names; calculating a word matching score between corresponding words in two of said supplier names, comprising; determining stems of said corresponding words; determining sound codes of said determined stems using a modified metaphone algorithm; determining a Levenshtein distance between said sound codes; calculating a prefix score using said stems and calculating a sound score using said sound codes; calculating a Levenshtein distance score using said determined Levenshtein distance and length of larger of said corresponding words; selecting one of said prefix score, said sound score and said Levenshtein distance score as said word matching score based on comparisons with a set threshold; calculating said name matching score between two of said supplier names based on said word matching score; and comparing said name matching score with a threshold value to determine a match, and grouping said supplier names based on said match. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. A computer program product comprising computer executable instructions embodied in a non-transitory computer-readable medium, said computer program product including:
-
a first computer parsable program code for correcting syntactical errors in supplier names; a second computer parsable program code for grouping the supplier names after said step of correcting syntactical errors; a third computer parsable program code for capturing abbreviations of the supplier names; a fourth computer parsable program code for correcting ordering, pronunciation and stemming errors in the supplier names; a fifth computer parsable program code of a matching algorithm to calculate a name matching score between two of said supplier names, comprising; a sixth computer parsable program code for grouping supplier names based on the first set of characters in the supplier names to avoid unnecessary squared n matching; a seventh computer parsable program code for calculating a word matching score between corresponding words in two of said supplier names, comprising; an eighth computer parsable program code for determining stems of said corresponding words; a ninth computer parsable program code for determining sound codes of said determined stems using; a tenth computer parsable program code for determining a Levenshtein distance between said sound codes; a eleventh computer parsable program code for calculating a prefix score using said stems and calculating a sound score using said sound codes; a twelfth computer parsable program code for calculating a Levenshtein distance score using said determined Levenshtein distance and length of larger of said corresponding words; a thirteenth computer parsable program code for selecting one of said prefix score, said sound score and said Levenshtein distance score as said word matching score based on comparisons with a set threshold; a fourteenth computer parsable program code for calculating said name matching score between two of said supplier names based on said word matching score; and a fifteenth computer parsable program code for comparing said name matching score with a threshold value to determine a match, and grouping said supplier names based on said match.
-
Specification