System and method for utilizing multiple encodings to identify similar language characters

US 9,128,915 B2
Filed: 08/03/2012
Issued: 09/08/2015
Est. Priority Date: 08/03/2012
Status: Active Grant

First Claim

Patent Images

1. A method for improving accuracy of data matching in a middleware machine environment by identifying a similarity between language characters of a character set of a language, wherein each language character has a unique structure, the method comprising:

providing a language character match engine, wherein the language character match engine executes on one or more microprocessor, wherein the language character match engine comprises a plurality of encoding components, including at least a first encoding component and a second encoding component and a third encoding component;

using the language character match engine to generate a composite similarity score set for the character set of the language wherein said similarity index comprises a composite similarity score for each of a plurality of pairs of language characters of the character set of the language;

wherein the composite similarity score for each of the plurality of pairs of language characters is prepared by,receiving the pair of language characters with the language character match engine,using the first encoding component to encode each language character of the pair of language characters based on the unique structure of each language character and generate, for each language character, a first-encoded string of identification characters representing the unique structure of the language character,comparing the first-encoded strings of identification characters for each of the pair of language characters to one another to generate a first-encoding similarity score for the pair of language characters,using the second encoding component to encode each language character of the pair of language characters based on the unique structure of each language character and generate, for each language character, a second-encoded string of identification characters representing the unique structure of the language character,comparing the second-encoded strings of identification characters for each of the pair of language characters to one another to generate a second-encoding similarity score for the pair of language characters,using the third encoding component to encode each language character of the pair of language characters based on the unique structure of each language character and generate, for each language character, a third-encoded string of identification characters representing the unique structure of the language character,comparing the third-encoded strings of identification characters for each of the pair of language characters to one another to generate a third-encoding similarity score for the pair of language characters, andcombining the first-encoding similarity score, the second-encoding similarity score, and the third-encoding similarity score for the pair of language characters to generate a composite similarity score for the pair of language characters.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Described herein are systems and methods for identifying the similarity between language characters. As described herein, a pair of language characters is received at a language character match engine. The language character match engine is adapted to receive encoding configuration information from each of a plurality of encoding components, and is adapted to encode the pair of language characters based on the unique structure of each language character to generate a pair of string identification characters for each encoding component. Thereafter, each pair of string identification characters is compared to one another to generate a similarity score, and the similarity score for each pair of string identification characters is combined to create a composite similarity score. The composite similarity score represents a similarity between the pair of language characters, and is used to identify the similarity between the pair of language characters.

37 Citations

20 Claims

1. A method for improving accuracy of data matching in a middleware machine environment by identifying a similarity between language characters of a character set of a language, wherein each language character has a unique structure, the method comprising:
- providing a language character match engine, wherein the language character match engine executes on one or more microprocessor, wherein the language character match engine comprises a plurality of encoding components, including at least a first encoding component and a second encoding component and a third encoding component;
  
  using the language character match engine to generate a composite similarity score set for the character set of the language wherein said similarity index comprises a composite similarity score for each of a plurality of pairs of language characters of the character set of the language;
  
  wherein the composite similarity score for each of the plurality of pairs of language characters is prepared by,receiving the pair of language characters with the language character match engine,using the first encoding component to encode each language character of the pair of language characters based on the unique structure of each language character and generate, for each language character, a first-encoded string of identification characters representing the unique structure of the language character,comparing the first-encoded strings of identification characters for each of the pair of language characters to one another to generate a first-encoding similarity score for the pair of language characters,using the second encoding component to encode each language character of the pair of language characters based on the unique structure of each language character and generate, for each language character, a second-encoded string of identification characters representing the unique structure of the language character,comparing the second-encoded strings of identification characters for each of the pair of language characters to one another to generate a second-encoding similarity score for the pair of language characters,using the third encoding component to encode each language character of the pair of language characters based on the unique structure of each language character and generate, for each language character, a third-encoded string of identification characters representing the unique structure of the language character,comparing the third-encoded strings of identification characters for each of the pair of language characters to one another to generate a third-encoding similarity score for the pair of language characters, andcombining the first-encoding similarity score, the second-encoding similarity score, and the third-encoding similarity score for the pair of language characters to generate a composite similarity score for the pair of language characters.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein:
    - the first encoding component is a Wubi encoding component;
      
      the first encoded strings of characters are Wubi-encoded strings of characters; and
      
      the first-encoding similarity score is a Wubi-encoding similarity score.
  - 3. The method of claim 2, wherein:
    - the second encoding component is a Cangiie encoding component;
      
      the second encoded strings of characters are Cangjie-encoded strings of characters; and
      
      the second-encoding similarity score is a Cangjie-encoding similarity score.
  - 4. The method of claim 3, wherein:
    - the third encoding component is a Four-Corner encoding component;
      
      the third encoded strings of characters are Four-Corner-encoded strings of characters; and
      
      the third-encoding similarity score is a Four-Corner-encoding similarity score.
  - 5. The method of claim 1, further comprising:
    - associating a first, second and, third predefined weight respectively to each of said first-encoding similarity score, the second-encoding similarity score, and the third-encoding similarity score for the pair of language characters when combining the first-encoding similarity score, the second-encoding similarity score, and the third-encoding similarity score for the pair of language characters to generate the composite similarity score for the pair of language characters.
  - 6. The method of claim 5, further comprising combining the first, second and, third predefined weight respectively with each of said first-encoding similarity score, the second-encoding similarity score, and the third-encoding similarity score for the pair of language characters when combining the first-encoding similarity score, the second-encoding similarity score, and the third-encoding similarity score for the pair of language characters to generate the composite similarity score for the pair of language characters.
  - 7. The method of claim 1, further comprising comparing the composite similarity for the pair of language characters score to a scale to classify the pair of language characters as one of an exact match, a partial match, and a mismatch.
  - 8. The method of claim 1, further comprising determining an edit distance between each of the plurality of pairs of language characters based on the composite similarity score of each of the pairs of language characters of the language.
  - 9. The method of claim 1, wherein, comparing the first-encoded strings of identification characters for each of the pair of language characters to one another to generate a first-encoding similarity score for the pair of language characters, comprises:
    - comparing first-encoded strings of identification characters to one another, digit by digit;
      
      assigning a score to each digit compared;
      
      computing a raw score by adding together the score from each digit compared; and
      
      normalizing the raw score to compute the first encoding similarity score.
  - 10. The method of claim 1, further comprisingusing a fourth encoding component to encode each language character of the pair of language characters based on phonetic properties of said characters, and generate, for each language character, a fourth-encoded string of identification characters representing the phonetic properties of the language character;
    - comparing the fourth-encoded strings of identification characters for each of the pair of language characters to one another to generate a fourth-encoding similarity score for the pair of language characters; and
      
      wherein said combining step comprises combining the first-encoding similarity score, the second-encoding similarity score, the third-encoding similarity score, and the fourth encoding similarity score for the pair of language characters to generate said composite similarity score for the pair of language characters.

11. A non-transitory computer readable storable medium storing instructions thereon for improving accuracy of data matching in a middleware machine environment by identifying a similarity between language characters of a language, wherein each language character has a unique structure, which instructions, when processed in a middleware machine of said middleware machine environment, cause the middleware machine to perform steps comprising:
- using the language character match engine to generate a composite similarity score set for the character set of the language wherein said similarity index comprises a composite similarity score for each of a plurality of pairs of language characters of the character set of the language, and wherein the composite similarity score for each of the plurality of pairs of language characters is prepared by,receiving the pair of language characters with a character match engine,using a first encoding component of the character match engine to encode each language character of the pair of language characters based on the unique structure of each language character and generate, for each language character, a first-encoded string of identification characters representing the unique structure of the language character,comparing the first-encoded strings of identification characters for each of the pair of language characters to one another to generate a first-encoding similarity score for the pair of language characters,using a second encoding component of the character match engine to encode each language character of the pair of language characters based on the unique structure of each language character and generate, for each language character, a second-encoded string of identification characters representing the unique structure of the language character,comparing the second-encoded strings of identification characters for each of the pair of language characters to one another to generate a second-encoding similarity score for the pair of language characters,using a third encoding component of the character match engine to encode each language character of the pair of language characters based on the unique structure of each language character and generate, for each language character, a third-encoded string of identification characters representing the unique structure of the language character,comparing the third-encoded strings of identification characters for each of the pair of language characters to one another to generate a third-encoding similarity score for the pair of language characters, andcombining the first-encoding similarity score, the second-encoding similarity score, and the third-encoding similarity score for the pair of language characters to generate a composite similarity score for the pair of language characters.
- View Dependent Claims (12, 13, 14, 15, 16, 17)
- - 12. The non-transitory computer readable storable medium of claim 11, wherein:
    - the third encoding component is a Four-Corner encoding component;
      
      the third encoded strings of characters are Four-Corner-encoded strings of characters; and
      
      the third-encoding similarity score is a Four-Corner-encoding similarity score.
  - 13. The non-transitory computer readable storable medium of claim 12, wherein:
    - the second encoding component is a Cangjie encoding component;
      
      the second encoded strings of characters are Cangjie-encoded strings of characters; and
      
      the second-encoding similarity score is a Cangjie-encoding similarity score.
  - 14. The non-transitory computer readable storable medium of claim 13, wherein:
    - the first encoding component is a Wubi encoding component;
      
      the first encoded strings of characters are Wubi-encoded strings of characters; and
      
      the first-encoding similarity score is a Wubi-encoding similarity score.
  - 15. The non-transitory computer readable storable medium of claim 11, storing further instructions thereon, which instructions, when processed in a middleware machine of said middleware machine environment, cause the middleware machine to perform further steps comprising:
    - associating a first, second and, third predefined weight respectively to each of said first-encoding similarity score, the second-encoding similarity score, and the third-encoding similarity score for the pair of language characters when combining the first-encoding similarity score, the second-encoding similarity score, and the third-encoding similarity score for the pair of language characters to generate the composite similarity score for the pair of language characters.
  - 16. The non-transitory computer readable storable medium of claim 11, wherein, comparing the first-encoded strings of identification characters for each of the pair of language characters to one another to generate a first-encoding similarity score for the pair of language characters, comprises:
    - comparing first-encoded strings of identification characters to one another, digit by digit;
      
      assigning a score to each digit compared;
      
      computing a raw score by adding together the score from each digit compared; and
      
      normalizing the raw score to compute the first encoding similarity score.
  - 17. The non-transitory computer readable storable medium of claim 11, storing further instructions thereon, which instructions, when processed in a middleware machine of said middleware machine environment, cause the middleware machine to perform further steps comprising:
    - using a fourth encoding component of said character match engine to encode each language character of the pair of language characters based on phonetic properties of said characters, and generate, for each language character, a fourth-encoded string of identification characters representing the phonetic properties of the language character;
      
      comparing the fourth-encoded strings of identification characters for each of the pair of language characters to one another to generate a fourth-encoding similarity score for the pair of language characters; and
      
      wherein said combining step comprises combining the first-encoding similarity score, the second-encoding similarity score, the third-encoding similarity score, and the fourth encoding similarity score for the pair of language characters to generate said composite similarity score for the pair of language characters.

18. A system for generating a similarity index identifying a similarity between language characters of a language, wherein each language character has a unique structure, the system comprising:
- a computer system comprising a microprocessor and a memory and a language character match engine, wherein said language character match engine comprises a plurality of encoding components for encoding a plurality of pairs of language characters of the language based on the unique structure of each language character;
  
  a first encoding component of the language character match engine which is configured to encode each language character of each of said plurality of pairs of language characters based on the unique structure of each language character, generate a first-encoded string of identification characters representing the unique structure of each language character, and compare the first-encoded strings of identification characters generated for each language character to one another to generate a first-encoding similarity score for each of the plurality of pairs of language characters;
  
  a second encoding component of the language character match engine which is configured to encode each language character of each of said plurality of pairs of language characters based on the unique structure of each language character, generate a second-encoded string of identification characters representing the unique structure of each language character, and compare the second-encoded strings of identification characters generated for each language character to one another to generate a second-encoding similarity score for each of the plurality of pairs of language characters;
  
  a third encoding component of the language character match engine which is configured to, encode each language character of each of said plurality of pairs of language characters based on the unique structure of each language character, generate a third-encoded string of identification characters representing the unique structure of each language character, and compare the third-encoded strings of identification characters generated for each language character to one another to generate a third-encoding similarity score for each of the plurality of pairs of language characters;
  
  wherein said language character match engine is configured to create a composite similarity score set for the character set of the language by receiving each of said plurality of pairs of language characters, and combining the first-encoding similarity score, the second-encoding similarity score and the third-encoding similarity score for each of the plurality of pairs of language characters to compute a composite similarity score for each of the plurality of pairs of language characters.
- View Dependent Claims (19, 20)
- - 19. The system of claim 18, wherein:
    - the third encoding component is a Four-Corner encoding component;
      
      the second encoding component is a Cangjie encoding component; and
      
      the first encoding component is a Wubi encoding component.
  - 20. The system of claim 19, wherein the language character match engine is configured to associate a first, second and, third predefined weight respectively to each of said first-encoding similarity score, the second-encoding similarity score, and the third-encoding similarity score for the pair of language characters when combining the first-encoding similarity score, the second-encoding similarity score, and the third-encoding similarity score for the pair of language characters to generate the composite similarity score for the pair of language characters.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Oracle International Corporation (Oracle Corporation)
Original Assignee
Oracle International Corporation (Oracle Corporation)
Inventors
Qian, Jun, Ouaguenouni, Sofiane
Primary Examiner(s)
Desir, Pierre-Louis
Assistant Examiner(s)
THOMAS-HOMESCU, ANNE L

Application Number

US13/566,385
Publication Number

US 20140052436A1
Time in Patent Office

1,131 Days
Field of Search

704/4, 704/705, 704/776, 704/2, 704/7, 704/706, 704/9, 704/748, 704/18, 704/12, 704/14, 382/181
US Class Current

1/1
CPC Class Codes

G06F 40/129 Handling non-Latin characte...

System and method for utilizing multiple encodings to identify similar language characters

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

37 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for utilizing multiple encodings to identify similar language characters

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

37 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links