Transliteration pair matching

US 9,176,936 B2
Filed: 09/28/2012
Issued: 11/03/2015
Est. Priority Date: 09/28/2012
Status: Active Grant

First Claim

Patent Images

1. An orthographic method for transliteration pair matching, said method comprising:

extracting feature sequences, as individual letters separated by spaces, from a digital representation of a proper name in a first language to obtain a first orthographic feature sequence set;

extracting feature sequences, as individual letters separated by spaces, from a digital representation of a proper name in a second language to obtain a second orthographic feature sequence set, said digital representation of said proper name in said first language and said digital representation of said proper name in said second language comprising a transliteration pair;

comparing said first and second orthographic feature sequence sets to determine a similarity score, based on a similarity model comprising a plurality of conditional probabilities of known orthographic feature sequences in said first language given known orthographic feature sequences in said second language and a plurality of conditional probabilities of known orthographic feature sequences in said second language given known orthographic feature sequences in said first language; and

based on at least one threshold value, determining whether said transliteration pair belong to an identical actual proper name.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Feature sequences are extracted, as individual letters separated by spaces, from a digital representation of a proper name in a first language to obtain a first orthographic feature sequence set; and from a digital representation of a proper name in a second language to obtain a second orthographic feature sequence set. The first and second orthographic feature sequence sets (a transliteration pair) are compared to determine a similarity score, based on a similarity model including a plurality of conditional probabilities of known orthographic feature sequences in the first language given known orthographic feature sequences in the second language and a plurality of conditional probabilities of known orthographic feature sequences in the second language given known orthographic feature sequences in the first language. Based on at least one threshold value, it is determined whether the transliteration pair belong to an identical actual proper name.

35 Citations

View as Search Results

25 Claims

1. An orthographic method for transliteration pair matching, said method comprising:
- extracting feature sequences, as individual letters separated by spaces, from a digital representation of a proper name in a first language to obtain a first orthographic feature sequence set;
  
  extracting feature sequences, as individual letters separated by spaces, from a digital representation of a proper name in a second language to obtain a second orthographic feature sequence set, said digital representation of said proper name in said first language and said digital representation of said proper name in said second language comprising a transliteration pair;
  
  comparing said first and second orthographic feature sequence sets to determine a similarity score, based on a similarity model comprising a plurality of conditional probabilities of known orthographic feature sequences in said first language given known orthographic feature sequences in said second language and a plurality of conditional probabilities of known orthographic feature sequences in said second language given known orthographic feature sequences in said first language; and
  
  based on at least one threshold value, determining whether said transliteration pair belong to an identical actual proper name.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, further comprising building said similarity model using statistical machine translation phrase tables.
  - 3. The method of claim 2, wherein said first language is character-based, further comprising rendering said digital representation of said proper name in said first language into a Romanized form prior to extracting said feature sequences for said digital representation of said proper name in said first language as said individual letters separated by spaces.
  - 4. The method of claim 2, wherein, in said extracting steps, at least some of said feature sequences comprise multiple features.
  - 5. The method of claim 2, wherein said comparing comprises carrying out Viterbi alignment based on said similarity model.
  - 6. The method of claim 2, further comprising estimating said similarity model based on discriminative training.
  - 7. The method of claim 6, further comprising updating said similarity model using minimum classification error training.
  - 8. The method of claim 1, wherein said determining comprises indicating that said transliteration pair indeed belong to said identical actual proper name if said similarity score exceeds at least one threshold value.
  - 9. The method of claim 1, wherein said extracting, comparing, and determining steps are repeated for a plurality of additional transliteration pairs with an equal error rate of less than two percent.
  - 10. The method of claim 1, further comprising providing a system, wherein the system comprises distinct software modules, each of the distinct software modules being embodied on a non-transitory computer-readable storage medium, and wherein the distinct software modules comprise a first language feature extraction module, a second language feature extraction module, a decoder module, and a comparator module;
    - wherein;
      
      said extracting of said feature sequences from said digital representation of said proper name in said first language is carried out by said first language feature extraction module executing on at least one hardware processor;
      
      said extracting of said feature sequences from said digital representation of said proper name in said second language is carried out by said second language feature extraction module executing on at least one hardware processor;
      
      said comparing of said first and second orthographic feature sequence sets is carried out by said decoder module executing on said at least one hardware processor; and
      
      said determining whether said transliteration pair belong to an identical actual proper name is carried out by said comparator module executing on said at least one hardware processor.

11. A non-transitory computer readable medium comprising computer executable instructions which when executed by a computer cause the computer to perform a method for transliteration pair matching, the method comprising the steps of:
- extracting feature sequences, as individual letters separated by spaces, from a digital representation of a proper name in a first language to obtain a first orthographic feature sequence set;
  
  extracting feature sequences, as individual letters separated by spaces, from a digital representation of a proper name in a second language to obtain a second orthographic feature sequence set, said digital representation of said proper name in said first language and said digital representation of said proper name in said second language comprising a transliteration pair;
  
  comparing said first and second orthographic feature sequence sets to determine a similarity score, based on a similarity model comprising a plurality of conditional probabilities of known orthographic feature sequences in said first language given known orthographic feature sequences in said second language and a plurality of conditional probabilities of known orthographic feature sequences in said second language given known orthographic feature sequences in said first language; and
  
  based on at least one threshold value, determining whether said transliteration pair belong to an identical actual proper name.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
- - 12. The non-transitory computer readable medium of claim 11, wherein the method further comprises building said similarity model using statistical machine translation phrase tables.
  - 13. The non-transitory computer readable medium of claim 12, wherein said first language is character-based, wherein the method further comprises rendering said digital representation of said proper name in said first language into a Romanized form prior to extracting said feature sequences for said digital representation of said proper name in said first language as said individual letters separated by spaces.
  - 14. The non-transitory computer readable medium of claim 12, wherein, in said steps of extracting feature sequences, at least some of said feature sequences comprise multiple features.
  - 15. The non-transitory computer readable medium of claim 12, wherein said comparing comprises carrying out Viterbi alignment based on said similarity model.
  - 16. The non-transitory computer readable medium of claim 12, wherein the method further comprises estimating said similarity model based on discriminative training.
  - 17. The non-transitory computer readable medium of claim 16, wherein the method further comprises updating said similarity model using minimum classification error training.
  - 18. The non-transitory computer readable medium of claim 11, wherein said determining comprises indicating that said transliteration pair indeed belong to said identical actual proper name if said similarity score exceeds at least one threshold value.
  - 19. The non-transitory computer readable medium of claim 11, wherein the method further comprises repeating said extracting, comparing, and determining for a plurality of additional transliteration pairs with an equal error rate of less than two percent.

20. An apparatus for transliteration pair matching comprising:
- a memory; and
  
  at least one processor, coupled to said memory, and operative to;
  
  extract feature sequences, as individual letters separated by spaces, from a digital representation of a proper name in a first language to obtain a first orthographic feature sequence set;
  
  extract feature sequences, as individual letters separated by spaces, from a digital representation of a proper name in a second language to obtain a second orthographic feature sequence set, said digital representation of said proper name in said first language and said digital representation of said proper name in said second language comprising a transliteration pair;
  
  compare said first and second orthographic feature sequence sets to determine a similarity score, based on a similarity model comprising a plurality of conditional probabilities of known orthographic feature sequences in said first language given known orthographic feature sequences in said second language and a plurality of conditional probabilities of known orthographic feature sequences in said second language given known orthographic feature sequences in said first language; and
  
  based on at least one threshold value, determine whether said transliteration pair belong to an identical actual proper name.
- View Dependent Claims (21, 22, 23, 24)
- - 21. The apparatus of claim 20, wherein said at least one processor is further operative to build said similarity model using statistical machine translation phrase tables.
  - 22. The apparatus of claim 21, wherein said first language is character-based, and wherein said at least one processor is further operative to render said digital representation of said proper name in said first language into a Romanized form prior to extracting said feature sequences for said digital representation of said proper name in said first language as said individual letters separated by spaces.
  - 23. The apparatus of claim 21, wherein at least some of said feature sequences comprise multiple features.
  - 24. The apparatus of claim 20, further comprising a plurality of distinct software modules, each of the distinct software modules being embodied on a non-transitory computer-readable storage medium, and wherein the distinct software modules comprise a first language feature extraction module, a second language feature extraction module, a decoder module, and a comparator module;
    - wherein;
      
      said at least one processor is operative to extract said feature sequences from said digital representation of said proper name in said first language by executing said first language feature extraction module;
      
      said at least one processor is operative to extract said feature sequences from said digital representation of said proper name in said second language by executing said second language feature extraction module;
      
      said at least one processor is operative to compare said first and second orthographic feature sequence sets by executing said decoder module; and
      
      said at least one processor is operative to determine whether said transliteration pair belong to an identical actual proper name by executing said comparator module.

25. An apparatus for transliteration pair matching comprising:
- means for extracting feature sequences, as individual letters separated by spaces, from a digital representation of a proper name in a first language to obtain a first orthographic feature sequence set;
  
  means for extracting feature sequences, as individual letters separated by spaces, from a digital representation of a proper name in a second language to obtain a second orthographic feature sequence set, said digital representation of said proper name in said first language and said digital representation of said proper name in said second language comprising a transliteration pair;
  
  means for comparing said first and second orthographic feature sequence sets to determine a similarity score, based on a similarity model comprising a plurality of conditional probabilities of known orthographic feature sequences in said first language given known orthographic feature sequences in said second language and a plurality of conditional probabilities of known orthographic feature sequences in said second language given known orthographic feature sequences in said first language; and
  
  means for, based on at least one threshold value, determining whether said transliteration pair belong to an identical actual proper name.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Jan, Ea-Ee, Ge, Niyu
Primary Examiner(s)
SPOONER, LAMONT M

Application Number

US13/630,479
Publication Number

US 20140095143A1
Time in Patent Office

1,131 Days
Field of Search

704 2- 8
US Class Current

1/1
CPC Class Codes

G06F 40/129   Handling non-Latin characte...

G06F 40/163   Handling of whitespace

G06F 40/232   Orthographic correction, e....

Transliteration pair matching

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

35 Citations

25 Claims

Specification

Use Cases

Quick Links

Others

Transliteration pair matching

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

35 Citations

25 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others