Method for disambiguating multiple readings in language conversion

US 8,706,472 B2
Filed: 08/11/2011
Issued: 04/22/2014
Est. Priority Date: 08/11/2011
Status: Active Grant

First Claim

Patent Images

1. A method, comprising:

at a device having one or more processors and memory;

receiving input data to be converted into a symbolic representation of the input data in a target symbolic system, the symbolic representation comprising a set of characters in the target symbolic system;

identifying a first candidate character for the symbolic representation based on a first portion of the input data, and a second candidate character for the symbolic representation based on a second portion of the input data, wherein the first candidate character has at least a first pronunciation and a second pronunciation each applicable to a respective usage context;

generating a plurality of candidate character strings, including at least a first candidate string comprising at least the first candidate character and the second candidate character; and

converting the input data to a selected one of the plurality of candidate character strings, said converting comprising;

determining a respective probability that the first candidate character string is a correct symbolic representation of the input data using a language model that individually accounts for a respective usage probability of the first candidate character in a first usage context comprising the second candidate character in combination with the first pronunciation of the first candidate character, and not the second pronunciation of the first candidate character, and wherein the language model is trained on an annotated corpus that associates the first pronunciation with the first candidate character used in respective contexts comprising the second candidate character.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disambiguating multiple readings in language conversion is disclosed, including: receiving an input data to be converted into a set of characters comprising a symbolic representation of the input data in a target symbolic system; and using a language model that distinguishes between a first reading and a second reading of a character of the target symbolic system to determine a probability that the heteronymous character should be used to represent a corresponding portion of the input data.

971 Citations

27 Claims

1. A method, comprising:
- at a device having one or more processors and memory;
  
  receiving input data to be converted into a symbolic representation of the input data in a target symbolic system, the symbolic representation comprising a set of characters in the target symbolic system;
  
  identifying a first candidate character for the symbolic representation based on a first portion of the input data, and a second candidate character for the symbolic representation based on a second portion of the input data, wherein the first candidate character has at least a first pronunciation and a second pronunciation each applicable to a respective usage context;
  
  generating a plurality of candidate character strings, including at least a first candidate string comprising at least the first candidate character and the second candidate character; and
  
  converting the input data to a selected one of the plurality of candidate character strings, said converting comprising;
  
  determining a respective probability that the first candidate character string is a correct symbolic representation of the input data using a language model that individually accounts for a respective usage probability of the first candidate character in a first usage context comprising the second candidate character in combination with the first pronunciation of the first candidate character, and not the second pronunciation of the first candidate character, and wherein the language model is trained on an annotated corpus that associates the first pronunciation with the first candidate character used in respective contexts comprising the second candidate character.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein the input text comprises pinyin.
  - 3. The method of claim 1, wherein the input text is resolved into one or more monosyllabic groups of characters that are each converted to a respective candidate character in the target symbolic system.
  - 4. The method of claim 1, wherein the target symbolic system includes Chinese characters.
  - 5. The method of claim 1, wherein the language model is trained using a corpus that has been annotated to distinguish between the first pronunciation of the first candidate character and the second pronunciation of the first candidate character.
  - 6. The method of claim 5, wherein for at least one of the first pronunciation and second pronunciation of the first candidate character, a corresponding new symbol or encoded representation thereof is created and added to the annotated corpus.
  - 7. The method of claim 1, further comprising:
    - receiving one or more manual input of annotations to a subset of text associated with a corpus, wherein a manual input of annotation indicates for an instance of a heteronymous character an appropriate pronunciation of that heteronymous character based at least in part on a context associated with the instance, wherein an annotation is associated with a symbol associated with that heteronymous character; and
      
      automatically annotating at least a portion of the text associated with the corpus that has not been manually annotated based at least in part on the received one or more manual input of annotations.
  - 8. The method of claim 1, wherein the language model is trained to associate a probability corresponding to the first pronunciation of the first candidate character and a probability corresponding to the second reading pronunciation of the first candidate character.
  - 9. The method of claim 1, wherein the language model is trained to associate a probability corresponding to a first sequence of characters including the first pronunciation of the character and a probability corresponding to a second sequence of characters including the second pronunciation of the character, wherein the first and second sequences each includes two or more characters.

10. A non-transitory computer-readable medium having instructions stored thereon, the instructions, when executed by one or more processors, cause the processors to perform operations comprising:
- receiving input data to be converted into a symbolic representation of the input data in a target symbolic system, the symbolic representation comprising a set of characters in the target symbolic system;
  
  identifying a first candidate character for the symbolic representation based on a first portion of the input data, and a second candidate character for the symbolic representation based on a second portion of the input data, wherein the first candidate character has at least a first pronunciation and a second pronunciation each applicable to a respective usage context;
  
  generating a plurality of candidate character strings, including at least a first candidate string comprising at least the first candidate character and the second candidate character; and
  
  converting the input data to a selected one of the plurality of candidate character strings, said converting comprising;
  
  determining a respective probability that the first candidate character string is a correct symbolic representation of the input data using a language model that individually accounts for a respective usage probability of the first candidate character in a first usage context comprising the second candidate character in combination with the first pronunciation of the first candidate character, and not the second pronunciation of the first candidate character, and wherein the language model is trained on an annotated corpus that associates the first pronunciation with the first candidate character used in respective contexts comprising the second candidate character.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- - 11. The computer-readable medium of claim 10, wherein the input text comprises pinyin.
  - 12. The computer-readable medium of claim 10, wherein the input text is resolved into one or more monosyllabic groups of characters that are each converted to a respective candidate character in the target symbolic system.
  - 13. The computer-readable medium of claim 10, wherein the target symbolic system includes Chinese characters.
  - 14. The computer-readable medium of claim 10, wherein the language model is trained using a corpus that has been annotated to distinguish between the first pronunciation and the second pronunciation of the first candidate character.
  - 15. The computer-readable medium of claim 14, wherein for at least one of the first pronunciation and second pronunciation of the first candidate character, a corresponding new symbol or encoded representation thereof is created and added to the annotated corpus.
  - 16. The computer-readable medium of claim 10, wherein the operations further comprise:
    - receiving one or more manual input of annotations to a subset of text associated with a corpus, wherein a manual input of annotation indicates for an instance of a heteronymous character an appropriate pronunciation of that heteronymous character based at least in part on a context associated with the instance, wherein an annotation is associated with a symbol associated with that heteronymous character; and
      
      automatically annotating at least a portion of the text associated with the corpus that has not been manually annotated based at least in part on the received one or more manual input of annotations.
  - 17. The computer-readable medium of claim 10, wherein the language model is trained to associate a probability corresponding to the first pronunciation of the first candidate character and a probability corresponding to the second pronunciation of the first candidate character.
  - 18. The computer-readable medium of claim 10, wherein the language model is trained to associate a probability corresponding to a first sequence of characters including the first pronunciation of the first candidate character and a probability corresponding to a second sequence of characters including the second pronunciation of the first candidate character, wherein the first and second sequences each includes two or more characters.

19. A system, comprising:
- one or more processors; and
  
  memory having instructions stored thereon, the instructions, when executed by one or more processors, cause the processors to perform operations comprising;
  
  receiving input data to be converted into a symbolic representation of the input data in a target symbolic system, the symbolic representation comprising a set of characters in the target symbolic system;
  
  identifying a first candidate character for the symbolic representation based on a first portion of the input data, and a second candidate character for the symbolic representation based on a second portion of the input data, wherein the first candidate character has at least a first pronunciation and a second pronunciation each applicable to a respective usage context;
  
  generating a plurality of candidate character strings, including at least a first candidate string comprising at least the first candidate character and the second candidate character; and
  
  converting the input data to a selected one of the plurality of candidate character strings, said converting comprising;
  
  determining a respective probability that the first candidate character string is a correct symbolic representation of the input data using a language model that individually accounts for a respective usage probability of the first candidate character in a first usage context comprising the second candidate character in combination with the first pronunciation of the first candidate character, and not the second pronunciation of the first candidate character, and wherein the language model is trained on an annotated corpus that associates the first pronunciation with the first candidate character used in respective contexts comprising the second candidate character.
- View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27)
- - 20. The system of claim 19, wherein the input text comprises pinyin.
  - 21. The system of claim 19, wherein the input text is resolved into one or more monosyllabic groups of characters that are each converted to a respective candidate character in the target symbolic system.
  - 22. The system of claim 19, wherein the target symbolic system includes Chinese characters.
  - 23. The system of claim 19, wherein the language model is trained using a corpus that has been annotated to distinguish between the first pronunciation and the second pronunciation of the first candidate character.
  - 24. The system of claim 23, wherein for at least one of the first pronunciation and second pronunciation of the first candidate character, a corresponding new symbol or encoded representation thereof is created and added to the annotated corpus.
  - 25. The system of claim 19, wherein the operations further comprise:
    - receiving one or more manual input of annotations to a subset of text associated with a corpus, wherein a manual input of annotation indicates for an instance of a heteronymous character an appropriate pronunciation of that heteronymous character based at least in part on a context associated with the instance, wherein an annotation is associated with a symbol associated with that heteronymous character; and
      
      automatically annotating at least a portion of the text associated with the corpus that has not been manually annotated based at least in part on the received one or more manual input of annotations.
  - 26. The system of claim 19, wherein the language model is trained to associate a probability corresponding to the first pronunciation of the first candidate character and a probability corresponding to the second pronunciation of the first candidate character.
  - 27. The system of claim 19, wherein the language model is trained to associate a probability corresponding to a first sequence of characters including the first pronunciation of the first candidate character and a probability corresponding to a second sequence of characters including the second pronunciation of the first candidate character, wherein the first and second sequences each includes two or more characters.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Apple Inc.
Original Assignee
Apple Inc.
Inventors
Ramerth, Brent D., Naik, Devang K., Davidson, Douglas R., Dolfing, Jannes G. A., Pu, Jia
Primary Examiner(s)
Godbold, Douglas
Assistant Examiner(s)
Estes, Ernest

Application Number

US13/208,222
Publication Number

US 20130041647A1
Time in Patent Office

985 Days
Field of Search

704 1- 10, 704/251, 704/255, 704/257
US Class Current

704/2
CPC Class Codes

G06F 40/53 Processing of non-Latin tex...

Method for disambiguating multiple readings in language conversion

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

971 Citations

27 Claims

Specification

Use Cases

Quick Links

Others

Method for disambiguating multiple readings in language conversion

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

971 Citations

27 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others