Method and Apparatus for Matching Misspellings Caused by Phonetic Variations

US 20150066474A1
Filed: 07/17/2014
Published: 03/05/2015
Est. Priority Date: 09/05/2013
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for matching terms, comprising the steps of:

a. receiving at a processor in communication with a computer-readable medium a first term and a second term, wherein each of the first term and second term comprises a character string stored on the computer-readable medium;

b. tokenizing at the processor the first term and the second term to create a first tokenized set comprising a plurality of first tokens from the first term and a second tokenized set comprising a plurality of second tokens from the second term, wherein each of the first tokens and second tokens comprises at least one consonant or consonant placeholder, and at least one vowel or vowel placeholder;

c. comparing at the processor each first token from the first tokenized set with a corresponding second token from the second tokenized set to determine if the first tokenized set comprises an equal number of tokens as the second tokenized set;

d. if the first tokenized set comprises an equal number of tokens as the second tokenized set, comparing the characters in each of the first tokens in the first tokenized set to the characters in the corresponding second token from the second tokenized set to determine if a match exists between the first term and the second term, wherein said comparison step is performed using a first compiled language library (CLL) comprising a set of consonants, a set of vowels, and a plurality of consonant equivalencies and vowel equivalencies whereby a match exists if the characters in each of the first tokens in the first tokenized set are identical to the characters in the corresponding second token from the second tokenized set or if the first tokens in the first tokenized set are equivalent to the characters in the corresponding second token from the second tokenized set; and

e. outputting from the processor an indicator of whether a match has occurred.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus for matching equivalent words across languages takes advantage of a set of rules that are built from a user-defined language specification (UDLS), which may be open source and customizable by a language expert. The UDLS is used to build a customer language library (CLL) that includes a list of consonants, a list of vowels, and rules defining phoneme equivalencies across two languages. The CLL is used to match equivalent words by both two-set and three-set matching to not only increase the number of true matches (i.e., overall accuracy), but also improve recognition of variations in a manner that is not language specific.

36 Citations

View as Search Results

27 Claims

1. A computer-implemented method for matching terms, comprising the steps of:
- a. receiving at a processor in communication with a computer-readable medium a first term and a second term, wherein each of the first term and second term comprises a character string stored on the computer-readable medium;
  
  b. tokenizing at the processor the first term and the second term to create a first tokenized set comprising a plurality of first tokens from the first term and a second tokenized set comprising a plurality of second tokens from the second term, wherein each of the first tokens and second tokens comprises at least one consonant or consonant placeholder, and at least one vowel or vowel placeholder;
  
  c. comparing at the processor each first token from the first tokenized set with a corresponding second token from the second tokenized set to determine if the first tokenized set comprises an equal number of tokens as the second tokenized set;
  
  d. if the first tokenized set comprises an equal number of tokens as the second tokenized set, comparing the characters in each of the first tokens in the first tokenized set to the characters in the corresponding second token from the second tokenized set to determine if a match exists between the first term and the second term, wherein said comparison step is performed using a first compiled language library (CLL) comprising a set of consonants, a set of vowels, and a plurality of consonant equivalencies and vowel equivalencies whereby a match exists if the characters in each of the first tokens in the first tokenized set are identical to the characters in the corresponding second token from the second tokenized set or if the first tokens in the first tokenized set are equivalent to the characters in the corresponding second token from the second tokenized set; and
  
  e. outputting from the processor an indicator of whether a match has occurred.
- View Dependent Claims (2)
- - 2. The computer-implemented method of claim 1, wherein the tokenizing step comprises a two-set tokenization step.

3. The computer-implemented method of claim 3, wherein the two-step tokenization step comprises the step of combining groups of zero or one consonants into a consonant set and combining groups of zero or one vowels into a vowel set to create a two-set token.
- View Dependent Claims (4, 5, 6, 7, 8, 9, 10)
- - 4. The computer-implemented method of claim 3, wherein a first placeholder symbol is used to represent a missing consonant in each two-set token comprising zero consonants.
  - 5. The computer-implemented method of claim 4, wherein a second placeholder symbol is used to represent a missing vowel in each two-set token comprising zero vowels.
  - 6. The computer-implemented method of claim 5, wherein if the first tokenized set comprises a different number of tokens as the second tokenized set, further comprising the step of performing three-step tokenization.
  - 7. The computer-implemented method of claim 6, wherein the step of performing three-step tokenization comprises the steps of:
    - a. tokenizing at the processor the first term and the second term to create a third tokenized set comprising a plurality of third tokens from the first term and a fourth tokenized set comprising a plurality of fourth tokens from the second term, wherein each of the third and fourth tokens comprises at least one leading consonant or leading consonant placeholder, at least one vowel or vowel placeholder, and at least one trailing consonant or trailing consonant placeholder;
      
      b. comparing at the processor each third token from the third tokenized set with a corresponding fourth token from the fourth tokenized set to determine if the third tokenized set comprises an equal number of tokens as the fourth tokenized set;
      
      c. if the third tokenized set comprises an equal number of tokens as the fourth tokenized set, comparing the characters in each of the third tokens in the third tokenized set to the characters in the corresponding fourth token from the fourth tokenized set to determine if a match exists between the first term and the second term, wherein said comparison step is performed using a second compiled language library (CLL) comprising a set of consonants, a set of vowels, and a plurality of leading consonant equivalencies, vowel equivalencies, and trailing consonant equivalencies, whereby a match exists if the characters in each of the third tokens in the third tokenized set are identical to the characters in the corresponding fourth token from the fourth tokenized set or if the third tokens in the third tokenized set are equivalent to the characters in the corresponding fourth token from the fourth tokenized set; and
      
      d. outputting from the processor an indicator of whether a match has occurred.
  - 8. The computer-implemented method of claim 7, further comprising the steps of calculating a similarity measure between the first term and the second term, and output from the processor the similarity measure.
  - 9. The computer-implemented method of claim 8, wherein the calculation of the similarity measure comprises a Levenshtein distance calculation.
  - 10. The computer-implemented method of claim 7, wherein the first CLL and second CLL are stored in a single file on the computer-readable medium.

11. A computer-implemented method for building a compiled language library for matching equivalent words, comprising the steps of:
- a. receiving at a processor from a computer-readable medium in communication with the processor a user-defined language specification (UDLS), wherein the UDLS comprises a plurality of consonants, a plurality of vowels, and a plurality of sets of matched terms, wherein each set of matched terms comprises a first term and a second term that differ in spelling but are equivalent, and wherein each of the first and second term comprise a character string;
  
  b. for each of the plurality of sets of matched terms, tokenizing at the processor the first term and the second term to create a first tokenized set comprising a plurality of first tokens from the first term and a second tokenized set comprising a plurality of second tokens from the second term, wherein each of the first tokens and second tokens comprises at least one consonant from the UDLS or consonant placeholder, and at least one vowel from the UDLS or vowel placeholder;
  
  c. comparing at the processor each first token from the first tokenized set with a corresponding second token from the second tokenized set to determine if the first tokenized set comprises an equal number of tokens as the second tokenized set;
  
  d. if the first tokenized set comprises an equal number of tokens as the second tokenized set, comparing the consonants or consonant placeholders and vowels or vowel placeholders in each of the first tokens in the first tokenized set to the consonants or consonant placeholders and vowels or vowel placeholders in the corresponding second token from the second tokenized set to determine if the characters are identical, and if the characters are not identical then creating a rule indicating the equivalency of the consonants or consonant placeholders and vowels or vowel placeholders, and writing the rule to a first compiled language library (CLL); and
  
  e. storing the first CLL on the computer-storage medium.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
- - 12. The computer-implemented method of claim 11, wherein the tokenizing step comprises two-set tokenization.
  - 13. The computer-implemented method of claim 12, wherein the two-step tokenization step comprises the step of combining groups of zero or one consonants into a consonant set and combining groups of zero or one vowels into a vowel set to create a two-set token.
  - 14. The computer-implemented method of claim 13, wherein if the first tokenized set comprises a different number of tokens as the second tokenized set, further comprising the step of performing three-step tokenization.
  - 15. The computer-implemented method of claim 14, wherein the step of performing three-step tokenization comprises the steps of:
    - a. tokenizing at the processor the first term and the second term to create a third tokenized set comprising a plurality of third tokens from the first term and a fourth tokenized set comprising a plurality of fourth tokens from the second term, wherein each of the third and fourth tokens comprises at least one leading consonant or leading consonant placeholder, at least one vowel or vowel placeholder, and at least one trailing consonant or trailing consonant placeholder;
      
      b. comparing at the processor each third token from the third tokenized set with a corresponding fourth token from the fourth tokenized set to determine if the third tokenized set comprises an equal number of tokens as the fourth tokenized set;
      
      c. if the third tokenized set comprises an equal number of tokens as the fourth tokenized set, comparing the characters in each of the third tokens in the third tokenized set to the characters in the corresponding fourth token from the fourth tokenized set to determine if the characters are identical, and if the characters are not identical then creating a rule indicating the equivalency of the leading consonants or leading consonant placeholders, vowels or vowel placeholders, and trailing consonants or trailing consonant placeholders, and writing the rule to a second compiled language library (CLL); and
      
      d. storing the second CLL on the computer-storage medium.
  - 16. The computer-implemented method of claim 15, wherein the three-step tokenization step comprises the step of combining groups of zero or one leading consonants into a leading consonant set, combining groups of zero or one vowels into a vowel set, and combining groups of zero or one trailing consonants into an optional trailing consonant to create a three-set token.
  - 17. The computer-implemented method of claim 16 wherein a first symbol is used to represent a missing leading consonant in each three-set token comprising zero leading consonants.
  - 18. The computer-implemented method of claim 17 wherein a second symbol is used to represent a missing vowel in each three-set token comprising zero vowels.
  - 19. The computer-implemented method of claim 18, wherein a third symbol is used to represent a missing optional trailing consonant in each three-set token comprising zero trailing consonants.

20. A computerized system for matching equivalent words with different spellings, comprising:
- a. a compiled language library (CLL) stored on a computer-readable medium, wherein the CLL comprises;
  
  i. a plurality of consonant phonemes;
  
  ii. a plurality of vowel phonemes; and
  
  iii. a plurality of matched equivalencies, wherein each of the plurality of matched letter equivalencies comprises a first phoneme and a second phoneme that are spelled differently but are equivalent;
  
  b. a processor in electronic communication with the computer-readable medium on which the CLL is stored;
  
  c. a random access memory (RAM) in electronic communication with the processor; and
  
  d. a computer program product stored on the computer-readable medium, comprising instructions that, when read into the RAM and executed on the processor, cause the processor to;
  
  i. receive a first term and a second term, wherein each of the first term and second term comprises at least one character;
  
  ii. tokenize the first term and the second term to create a first tokenized set comprising a plurality of first tokens comprising at least one phoneme and a second tokenized set comprising a plurality of second tokens comprising at least one phoneme;
  
  iii. compare the first tokenized set to the second tokenized set to determine if the first tokenized set comprises an equal number of tokens as the second tokenized set;
  
  iv. determine if the first tokenized set comprises an equal number of tokens as the second tokenized set, and if so compare the first tokens to the second tokens to determine if the tokens are identical or equivalent, wherein two tokens are equivalent if they are matched in a matched equivalency in the CLL; and
  
  v. output from the processor an indicator if a match has occurred.
- View Dependent Claims (21, 22, 23, 24, 25, 26, 27)
- - 21. The computerized system of claim 20, wherein the computer program product stored on the computer-readable medium further comprising instructions that, when read into the RAM and executed on the processor, cause the processor to, when comparing the first tokens to the second tokens to determine if the tokens are identical of equivalent, compare a first phoneme in the first token to a second phoneme in the second token, determine if the first phoneme and second phoneme are identical, and if not read the matched equivalencies in the CLL to determine if the first phoneme and second phoneme are equivalent.
  - 22. The computerized system of claim 21, wherein the computer program product stored on the computer-readable medium further comprise instructions that, when read into the RAM and executed on the processor, cause the processor to tokenize the first term and the second term by two-set tokenization.
  - 23. The computerized system of claim 22, wherein the computer program product stored on the computer-readable medium further comprise instructions that, when read into the RAM and executed on the processor, cause the processor to tokenize the first term and the second term by combining groups of zero or one consonants into a consonant set and combining groups of zero or one vowels into a vowel set to create a two-set token.
  - 24. The computerized system of claim 23, wherein the computer program product stored on the computer-readable medium further comprise instructions that, when read into the RAM and executed on the processor, cause the processor to tokenize the first term and the second term by inserting a first placeholder symbol to represent a missing consonant in each two-set token comprising zero consonants.
  - 25. The computerized system of claim 24, wherein the computer program product stored on the computer-readable medium further comprise instructions that, when read into the RAM and executed on the processor, cause the processor to tokenize the first term and the second term by inserting a second placeholder symbol to represent a missing vowel in each two-set token comprising zero vowels.
  - 26. The computerized system of claim 25, wherein the computer program product stored on the computer-readable medium further comprise instructions that, when read into the RAM and executed on the processor, cause the processor to, if the first tokenized set does not comprise an equal number of tokens as the second tokenized set, tokenize the first term and the second term by three-step tokenization.
  - 27. The computerized system of claim 26, wherein the computer program product stored on the computer-readable medium further comprise instructions that, when read into the RAM and executed on the processor, cause the processor to:
    - a. tokenize the first term and the second term to create a third tokenized set comprising a plurality of third tokens comprising at least one phoneme and a fourth tokenized set comprising a plurality of fourth tokens comprising at least one phoneme;
      
      b. compare the third tokenized set to the fourth tokenized set to determine if the third tokenized set comprises an equal number of tokens as the fourth tokenized set;
      
      c. determine if the third tokenized set comprises an equal number of tokens as the fourth tokenized set, and if so compare the third tokens to the fourth tokens to determine if the tokens are identical or equivalent; and
      
      d. output from the processor an indicator of whether a match has occurred or has not occurred.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
LiveRamp, Inc. (LiveRamp Holdings, Inc.)
Original Assignee
Acxiom Corporation (LiveRamp Holdings, Inc.)
Inventors
Yi, Gon, Miyahira, Aaron, Marupally, Pavan

Granted Patent

US 9,594,742 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/8
CPC Class Codes

G06F 40/232 Orthographic correction, e....

G06F 40/284 Lexical analysis, e.g. toke...

Method and Apparatus for Matching Misspellings Caused by Phonetic Variations

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

36 Citations

27 Claims

Specification

Solutions

Use Cases

Quick Links

Method and Apparatus for Matching Misspellings Caused by Phonetic Variations

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

36 Citations

27 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links