Method and Apparatus for Matching Misspellings Caused by Phonetic Variations
First Claim
1. A computer-implemented method for matching terms, comprising the steps of:
- a. receiving at a processor in communication with a computer-readable medium a first term and a second term, wherein each of the first term and second term comprises a character string stored on the computer-readable medium;
b. tokenizing at the processor the first term and the second term to create a first tokenized set comprising a plurality of first tokens from the first term and a second tokenized set comprising a plurality of second tokens from the second term, wherein each of the first tokens and second tokens comprises at least one consonant or consonant placeholder, and at least one vowel or vowel placeholder;
c. comparing at the processor each first token from the first tokenized set with a corresponding second token from the second tokenized set to determine if the first tokenized set comprises an equal number of tokens as the second tokenized set;
d. if the first tokenized set comprises an equal number of tokens as the second tokenized set, comparing the characters in each of the first tokens in the first tokenized set to the characters in the corresponding second token from the second tokenized set to determine if a match exists between the first term and the second term, wherein said comparison step is performed using a first compiled language library (CLL) comprising a set of consonants, a set of vowels, and a plurality of consonant equivalencies and vowel equivalencies whereby a match exists if the characters in each of the first tokens in the first tokenized set are identical to the characters in the corresponding second token from the second tokenized set or if the first tokens in the first tokenized set are equivalent to the characters in the corresponding second token from the second tokenized set; and
e. outputting from the processor an indicator of whether a match has occurred.
3 Assignments
0 Petitions
Accused Products
Abstract
A method and apparatus for matching equivalent words across languages takes advantage of a set of rules that are built from a user-defined language specification (UDLS), which may be open source and customizable by a language expert. The UDLS is used to build a customer language library (CLL) that includes a list of consonants, a list of vowels, and rules defining phoneme equivalencies across two languages. The CLL is used to match equivalent words by both two-set and three-set matching to not only increase the number of true matches (i.e., overall accuracy), but also improve recognition of variations in a manner that is not language specific.
36 Citations
27 Claims
-
1. A computer-implemented method for matching terms, comprising the steps of:
-
a. receiving at a processor in communication with a computer-readable medium a first term and a second term, wherein each of the first term and second term comprises a character string stored on the computer-readable medium; b. tokenizing at the processor the first term and the second term to create a first tokenized set comprising a plurality of first tokens from the first term and a second tokenized set comprising a plurality of second tokens from the second term, wherein each of the first tokens and second tokens comprises at least one consonant or consonant placeholder, and at least one vowel or vowel placeholder; c. comparing at the processor each first token from the first tokenized set with a corresponding second token from the second tokenized set to determine if the first tokenized set comprises an equal number of tokens as the second tokenized set; d. if the first tokenized set comprises an equal number of tokens as the second tokenized set, comparing the characters in each of the first tokens in the first tokenized set to the characters in the corresponding second token from the second tokenized set to determine if a match exists between the first term and the second term, wherein said comparison step is performed using a first compiled language library (CLL) comprising a set of consonants, a set of vowels, and a plurality of consonant equivalencies and vowel equivalencies whereby a match exists if the characters in each of the first tokens in the first tokenized set are identical to the characters in the corresponding second token from the second tokenized set or if the first tokens in the first tokenized set are equivalent to the characters in the corresponding second token from the second tokenized set; and e. outputting from the processor an indicator of whether a match has occurred. - View Dependent Claims (2)
-
- 3. The computer-implemented method of claim 3, wherein the two-step tokenization step comprises the step of combining groups of zero or one consonants into a consonant set and combining groups of zero or one vowels into a vowel set to create a two-set token.
-
11. A computer-implemented method for building a compiled language library for matching equivalent words, comprising the steps of:
-
a. receiving at a processor from a computer-readable medium in communication with the processor a user-defined language specification (UDLS), wherein the UDLS comprises a plurality of consonants, a plurality of vowels, and a plurality of sets of matched terms, wherein each set of matched terms comprises a first term and a second term that differ in spelling but are equivalent, and wherein each of the first and second term comprise a character string; b. for each of the plurality of sets of matched terms, tokenizing at the processor the first term and the second term to create a first tokenized set comprising a plurality of first tokens from the first term and a second tokenized set comprising a plurality of second tokens from the second term, wherein each of the first tokens and second tokens comprises at least one consonant from the UDLS or consonant placeholder, and at least one vowel from the UDLS or vowel placeholder; c. comparing at the processor each first token from the first tokenized set with a corresponding second token from the second tokenized set to determine if the first tokenized set comprises an equal number of tokens as the second tokenized set; d. if the first tokenized set comprises an equal number of tokens as the second tokenized set, comparing the consonants or consonant placeholders and vowels or vowel placeholders in each of the first tokens in the first tokenized set to the consonants or consonant placeholders and vowels or vowel placeholders in the corresponding second token from the second tokenized set to determine if the characters are identical, and if the characters are not identical then creating a rule indicating the equivalency of the consonants or consonant placeholders and vowels or vowel placeholders, and writing the rule to a first compiled language library (CLL); and e. storing the first CLL on the computer-storage medium. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A computerized system for matching equivalent words with different spellings, comprising:
-
a. a compiled language library (CLL) stored on a computer-readable medium, wherein the CLL comprises; i. a plurality of consonant phonemes; ii. a plurality of vowel phonemes; and iii. a plurality of matched equivalencies, wherein each of the plurality of matched letter equivalencies comprises a first phoneme and a second phoneme that are spelled differently but are equivalent; b. a processor in electronic communication with the computer-readable medium on which the CLL is stored; c. a random access memory (RAM) in electronic communication with the processor; and d. a computer program product stored on the computer-readable medium, comprising instructions that, when read into the RAM and executed on the processor, cause the processor to; i. receive a first term and a second term, wherein each of the first term and second term comprises at least one character; ii. tokenize the first term and the second term to create a first tokenized set comprising a plurality of first tokens comprising at least one phoneme and a second tokenized set comprising a plurality of second tokens comprising at least one phoneme; iii. compare the first tokenized set to the second tokenized set to determine if the first tokenized set comprises an equal number of tokens as the second tokenized set; iv. determine if the first tokenized set comprises an equal number of tokens as the second tokenized set, and if so compare the first tokens to the second tokens to determine if the tokens are identical or equivalent, wherein two tokens are equivalent if they are matched in a matched equivalency in the CLL; and v. output from the processor an indicator if a match has occurred. - View Dependent Claims (21, 22, 23, 24, 25, 26, 27)
-
Specification