Method and apparatus for matching misspellings caused by phonetic variations
First Claim
1. A computer-implemented method for identifying phonetic equivalents between words spoken in a natural source language by native speakers of the source language and words spoken by non-native speakers of the source language who natively speak a common natural second language, comprising the steps of:
- a. receiving at a processor in communication with a computer-readable medium a first term and a second term, wherein each of the first term and second term comprises a character string stored on the computer-readable medium and at least one of the first and second terms is derived from a non-native speaker of the source language;
b. tokenizing at the processor the first term and the second term to create a first tokenized set comprising a plurality of first tokens from the first term and a second tokenized set comprising a plurality of second tokens from the second term, wherein each of the first tokens and second tokens comprises at least one consonant or consonant placeholder, and at least one vowel or vowel placeholder;
c. after the tokenizing step, comparing at the processor each first token from the first tokenized set with a corresponding second token from the second tokenized set to determine if the first tokenized set comprises an equal number of tokens as the second tokenized set;
d. if the first tokenized set comprises an equal number of tokens as the second tokenized set, comparing the characters in each of the first tokens in the first tokenized set to the characters in the corresponding second token from the second tokenized set to determine if a match exists between the first term and the second term, wherein said comparison step is performed using a first compiled language library (CLL) comprising a set of equivalent consonant pairs and a set of equivalent vowel pairs, wherein an equivalence exists if the characters in each of the first tokens in the first tokenized set are identical to the characters in the corresponding second token from the second tokenized set or if the first tokens in the first tokenized set are equivalent to the characters in the corresponding second token from the second tokenized set, wherein consonant equivalencies and vowel equivalencies are found based on a phonetically identical pronunciation of such consonants and vowels by the non-native speakers of the source language who natively speak the common second language; and
e. outputting from the processor an indicator of whether the first and second terms are phonetic equivalents.
3 Assignments
0 Petitions
Accused Products
Abstract
A method and apparatus for matching equivalent words across languages takes advantage of a set of rules that are built from a user-defined language specification (UDLS), which may be open source and customizable by a language expert. The UDLS is used to build a customer language library (CLL) that includes a list of consonants, a list of vowels, and rules defining phoneme equivalencies across two languages. The CLL is used to match equivalent words by both two-set and three-set matching to not only increase the number of true matches (i.e., overall accuracy), but also improve recognition of variations in a manner that is not language specific.
37 Citations
41 Claims
-
1. A computer-implemented method for identifying phonetic equivalents between words spoken in a natural source language by native speakers of the source language and words spoken by non-native speakers of the source language who natively speak a common natural second language, comprising the steps of:
-
a. receiving at a processor in communication with a computer-readable medium a first term and a second term, wherein each of the first term and second term comprises a character string stored on the computer-readable medium and at least one of the first and second terms is derived from a non-native speaker of the source language; b. tokenizing at the processor the first term and the second term to create a first tokenized set comprising a plurality of first tokens from the first term and a second tokenized set comprising a plurality of second tokens from the second term, wherein each of the first tokens and second tokens comprises at least one consonant or consonant placeholder, and at least one vowel or vowel placeholder; c. after the tokenizing step, comparing at the processor each first token from the first tokenized set with a corresponding second token from the second tokenized set to determine if the first tokenized set comprises an equal number of tokens as the second tokenized set; d. if the first tokenized set comprises an equal number of tokens as the second tokenized set, comparing the characters in each of the first tokens in the first tokenized set to the characters in the corresponding second token from the second tokenized set to determine if a match exists between the first term and the second term, wherein said comparison step is performed using a first compiled language library (CLL) comprising a set of equivalent consonant pairs and a set of equivalent vowel pairs, wherein an equivalence exists if the characters in each of the first tokens in the first tokenized set are identical to the characters in the corresponding second token from the second tokenized set or if the first tokens in the first tokenized set are equivalent to the characters in the corresponding second token from the second tokenized set, wherein consonant equivalencies and vowel equivalencies are found based on a phonetically identical pronunciation of such consonants and vowels by the non-native speakers of the source language who natively speak the common second language; and e. outputting from the processor an indicator of whether the first and second terms are phonetic equivalents. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computer-implemented method for building a compiled language library for establishing the equivalence of words from a source language as spoken by non-native speakers of the source language who natively speak a common second language, comprising the steps of:
-
a. receiving at a processor from a computer-readable medium in communication with the processor a user-defined language specification (UDLS), wherein the UDLS comprises a plurality of consonants, a plurality of vowels, and a plurality of sets of equivalent natural language terms, wherein each set of equivalent terms comprises a first term and a second term that differ in spelling but are phonetically equivalent with respect to native speakers of the second language, and wherein each of the first and second term comprise a character string; b. for each of the plurality of sets of equivalent terms, tokenizing at the processor the first term and the second term to create a first tokenized set comprising a plurality of first tokens from the first term and a second tokenized set comprising a plurality of second tokens from the second term, wherein each of the first tokens and second tokens comprises at least one consonant from the UDLS or consonant placeholder, and at least one vowel from the UDLS or vowel placeholder; c. after the tokenizing step, comparing at the processor each first token from the first tokenized set with a corresponding second token from the second tokenized set to determine if the first tokenized set comprises an equal number of tokens as the second tokenized set; d. if the first tokenized set comprises an equal number of tokens as the second tokenized set, comparing the consonants or consonant placeholders and vowels or vowel placeholders in each of the first tokens in the first tokenized set to the consonants or consonant placeholders and vowels or vowel placeholders in the corresponding second token from the second tokenized set to determine if the characters are identical, and if the characters are not identical then creating a rule indicating the equivalency of the consonants or consonant placeholders and vowels or vowel placeholders, and writing the rule to a first compiled language library (CLL), wherein the rule comprises at least one of a pair of consonants or a pair of vowels; and e. storing the first CLL on the computer-storage medium. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A computerized system for establishing the equivalence of phonetically equivalent words of a first language with different spellings, comprising:
-
a. a compiled language library (CLL) stored on a computer-readable medium, wherein the CLL comprises; i. a plurality of consonant phonemes; ii. a plurality of vowel phonemes; and iii. a plurality of equivalent letter pairs, wherein each of the plurality of equivalent letter pairs comprises a first phoneme and a second phoneme that are spelled differently but are phonetically identical with respect to non-native speakers of the first language but who natively speak a common second language; b. a processor in electronic communication with the computer-readable medium on which the CLL is stored; c. a random access memory (RAM) in electronic communication with the processor; and d. a computer program product stored on the computer-readable medium, comprising instructions that, when read into the RAM and executed on the processor, cause the processor to; i. receive a first term and a second term, wherein each of the first term and second term comprises at least one character; ii. tokenize the first term and the second term to create a first tokenized set comprising a plurality of first tokens comprising at least one phoneme and a second tokenized set comprising a plurality of second tokens comprising at least one phoneme; iii. after the process tokenizes the first and second term, compare the first tokenized set to the second tokenized set to determine if the first tokenized set comprises an equal number of tokens as the second tokenized set; iv. determine if the first tokenized set comprises an equal number of tokens as the second tokenized set, and if so compare the first tokens to the second tokens to determine if the tokens are identical or phonetically equivalent, wherein two tokens are phonetically equivalent if they are equivalent in the CLL; and v. output from the processor an indicator if the first term and the second term are phonetically equivalent. - View Dependent Claims (21, 22, 23, 24, 25, 26, 27)
-
-
28. A method for real-time verification of a potential customer'"'"'s identity utilizing a microprocessor, comprising the steps of:
-
a. receiving at the microprocessor across a network a first character string and a second character string, wherein each of the first character string and second character string is a text representation of a proper noun represented in a first language; b. in real time, tokenizing at the microprocessor the first character string and the second character string to create a first tokenized set comprising a plurality of first tokens from the first character string and a second tokenized set comprising a plurality of second tokens from the second character string, wherein each of the first tokens and second tokens comprises at least one consonant or consonant placeholder, and at least one vowel or vowel placeholder; c. in real time, analyzing at the microprocessor each first token from the first tokenized set and a corresponding second token from the second tokenized set to determine if the first tokenized set comprises an equal number of tokens as the second tokenized set; d. if the first tokenized set comprises an equal number of tokens as the second tokenized set, analyzing the characters in each of the first tokens in the first tokenized set and the characters in the corresponding second token from the second tokenized set to determine in real time if a match exists between the first character string and the second character string, wherein said analyzing step is performed using a first compiled language library (CLL) comprising a set of consonants, a set of vowels, and a plurality of consonant pairs and vowel pairs, wherein the consonant pairs and vowel pairs are pairs of letters that represent an equivalent sound when spoken or heard by a person whose native language is a second language different from the first language, and wherein a correspondence exists if the characters in each of the first tokens in the first tokenized set are identical to the characters in the corresponding second token from the second tokenized set or if the first tokens in the first tokenized set are phonetically equivalent to the characters in the corresponding second token from the second tokenized set; and e. if a correspondence exists in step (d) above, outputting in real time across the network a result from the microprocessor indicating that the potential customer'"'"'s identity is verified, and if a correspondence does not exist in step (d) above, outputting across the network in real time a result from the microprocessor indicating that the potential customer'"'"'s identity is not verified. - View Dependent Claims (29, 30, 31, 32, 33, 34, 35, 36, 37, 38)
-
-
39. A computer-implemented method for building a compiled language library for establishing the equivalence of words, comprising the steps of:
-
a. receiving at a processor from a computer-readable medium in communication with the processor a user-defined language specification (UDLS), wherein the UDLS comprises a plurality of consonants, a plurality of vowels, and a plurality of sets of equivalent terms, wherein each set of equivalent terms comprises a first term and a second term that differ in spelling but are phonetically equivalent, and wherein each of the first and second term comprise a character string; b. for each of the plurality of sets of equivalent terms, tokenizing at the processor the first term and the second term to create a first tokenized set comprising a plurality of first tokens from the first term and a second tokenized set comprising a plurality of second tokens from the second term, wherein each of the first tokens and second tokens comprises at least one consonant from the UDLS or consonant placeholder, and at least one vowel from the UDLS or vowel placeholder, wherein the tokenizing step comprises two-set tokenization comprising the step of combining groups of zero or one consonants into a consonant set and combining groups of zero or one vowels into a vowel set to create a two-set token; c. comparing at the processor each first token from the first tokenized set with a corresponding second token from the second tokenized set to determine if the first tokenized set comprises an equal number of tokens as the second tokenized set; d. if the first tokenized set comprises an equal number of tokens as the second tokenized set, comparing the consonants or consonant placeholders and vowels or vowel placeholders in each of the first tokens in the first tokenized set to the consonants or consonant placeholders and vowels or vowel placeholders in the corresponding second token from the second tokenized set to determine if the characters are identical, and if the characters are not identical then creating a rule indicating the equivalency of the consonants or consonant placeholders and vowels or vowel placeholders, and writing the rule to a first compiled language library (CLL); e. storing the first CLL on the computer-storage medium; f. if the first tokenized set comprises a different number of tokens as the second tokenized set, further comprising the step of performing three-step tokenization, wherein the step of performing three-step tokenization comprises the steps of; i. tokenizing at the processor the first term and the second term to create a third tokenized set comprising a plurality of third tokens from the first term and a fourth tokenized set comprising a plurality of fourth tokens from the second term, wherein each of the third and fourth tokens comprises at least one leading consonant or leading consonant placeholder, at least one vowel or vowel placeholder, and at least one trailing consonant or trailing consonant placeholder, further wherein the three-step tokenization step comprises the step of combining groups of zero or one leading consonants into a leading consonant set, combining groups of zero or one vowels into a vowel set, and combining groups of zero or one trailing consonants into an optional trailing consonant to create a three-set token, and further wherein a first symbol is used to represent a missing leading consonant in each three-set token comprising zero leading consonants; ii. comparing at the processor each third token from the third tokenized set with a corresponding fourth token from the fourth tokenized set to determine if the third tokenized set comprises an equal number of tokens as the fourth tokenized set; iii. if the third tokenized set comprises an equal number of tokens as the fourth tokenized set, comparing the characters in each of the third tokens in the third tokenized set to the characters in the corresponding fourth token from the fourth tokenized set to determine if the characters are identical, and if the characters are not identical then creating a rule indicating the equivalency of the leading consonants or leading consonant placeholders, vowels or vowel placeholders, and trailing consonants or trailing consonant placeholders, and writing the rule to a second compiled language library (CLL); and g. storing the second CLL on the computer-storage medium. - View Dependent Claims (40, 41)
-
Specification