Scalable neural network-based language identification from written text

US 20040078191A1
Filed: 10/22/2002
Published: 04/22/2004
Est. Priority Date: 10/22/2002
Status: Abandoned Application

First Claim

Patent Images

1. A method of identifying a language of a string of alphabet characters among a plurality of languages based on an automatic language identification system, each said plurality of languages having an individual set of alphabet characters, said method characterized by mapping the string of alphabet characters into a mapped string of alphabet characters selected from a reference set of alphabet characters, obtaining a first value indicative of a probability of the mapped string of alphabet characters being each one of said plurality of languages, obtaining a second value indicative of a match of the alphabet characters in the string in each individual set, and deciding the language of the string based on the first value and the second value.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for language identification from written text, wherein a neural network-based language identification system is used to identify the language of a string of alphabet characters among a plurality of languages. A standard set of alphabet characters is used for mapping the string into a mapped string of alphabet characters so as to allow the NN-LID system to determine the likelihood of the mapped string being one of languages based on the standard set. The characters of the standard set are selected from the alphabet characters of the language-dependent sets. A scoring system is also used to determine the likelihood of the string being each one of the languages based on the language-dependent sets.

110 Citations

View as Search Results

25 Claims

1. A method of identifying a language of a string of alphabet characters among a plurality of languages based on an automatic language identification system, each said plurality of languages having an individual set of alphabet characters, said method characterized by mapping the string of alphabet characters into a mapped string of alphabet characters selected from a reference set of alphabet characters, obtaining a first value indicative of a probability of the mapped string of alphabet characters being each one of said plurality of languages, obtaining a second value indicative of a match of the alphabet characters in the string in each individual set, and deciding the language of the string based on the first value and the second value.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method of claim 1, further characterized in that the number of alphabet characters in the reference set is smaller than the union set of said all individual sets of alphabet characters.
  - 3. The method of claim 1, characterized in that the first value is obtained based on the reference set.
  - 4. The method of claim 3, characterized in that the reference set comprises a minimum set of standard alphabet characters such that every alphabet character in the individual set for each of said plurality of languages is uniquely mappable to one of the standard alphabet characters.
  - 5. The method of claim 3, characterized in that the reference set consists of a minimum set of standard alphabet characters and a null symbol, such that every alphabet character in the individual set for each of said plurality of languages is uniquely mappable to one of said standard alphabet characters.
  - 6. The method of claim 5, characterized in that the number of alphabet characters in the mapped string is equal to the number of the alphabet characters in the string.
  - 7. The method of claim 4, characterized in that the reference set comprises the minimum set of standard alphabet characters and at least one symbol different from the standard alphabet characters, so that each alphabet characters in at least one individual set is uniquely mappable to a combination of one of said standard alphabet characters and said at least one symbol.
  - 8. The method of claim 4, characterized in that the reference set comprises the minimum set of standard alphabet characters and a plurality of symbols different from the standard alphabet characters, so that each alphabet characters in at least one individual set is uniquely mappable to a combination of said standard alphabet characters and said at least one of said plurality of symbols.
  - 9. The method of claim 8, characterized in that the number of symbols is adjustable according to a desired performance of the automatic language identification system.
  - 10. The method of claim 1, characterized in that the automatic language identification system is a neural-network based system comprising a plurality of hidden units, and that the number of the hidden units is adjustable according to a desired performance of the automatic language identification system.
  - 11. The method of claim 3, characterized in that the automatic language identification system is a neural-network based system and the probability is computed by the neural-network based system.
  - 12. The method of claim 1, characterized in that the second value is obtained from a scaling factor assigned to a probability of the string given one of said plurality of languages.
  - 13. The method of claim 12, characterized in that the language is decided based on the maximum of the product of the first value and the second value among said plurality of languages.

14. A method of identifying a language of a string of alphabet characters among a plurality of languages based on an automatic language identification system, said plurality of languages classified into a plurality of language groups, each group having an individual set of alphabet characters, said method characterized by mapping the string of alphabet characters into a mapped string of alphabet characters selected from a reference set of alphabet characters, by obtaining a first value indicative of a probability of the mapped string of alphabet characters being each one of said plurality of languages, obtaining a second value indicative of a match of the alphabet characters in the string in each individual set, and deciding the language of the string based on the first value and the second value.
- View Dependent Claims (15, 16)
- - 15. The method of claim 14, further characterized in that the number of alphabet characters in the reference set is smaller than the union set of said all individual sets of alphabet characters.
  - 16. The method of claim 14, characterized in that the first value is obtained based on the reference set.

17. A language identification system for identifying a language of a string of alphabet characters among a plurality of languages, each of said plurality of languages having an individual set of alphabet characters, said system characterized by:
- a reference set of alphabet characters, a mapping module for mapping the string of alphabet characters into a mapped string of alphabet characters selected from the reference set for providing a signal indicative of the mapped string, a first language discrimination module, responsive to the signal, for determining the likelihood of the mapped string being each one of said plurality of languages based on the reference set for providing first information indicative of the likelihood, a second language discrimination module, for determining the likelihood of the string being each one of said plurality of languages based on the individual sets of alphabet characters for providing second information indicative of the likelihood, and a decision module, responsive to the first information and second information, for determining the combined likelihood of the string being one of said plurality of languages based on the first information and second information.
- View Dependent Claims (18, 19, 20)
- - 18. The system of claim 17, further characterized in that the number of alphabet characters in the reference set is smaller than the union set of said all individual sets of alphabet characters.
  - 19. The language identification system of claim 17, characterized in that the first language discrimination module is a neural-network based system comprising a plurality of hidden units, and the language identification system comprises a memory unit for storing the reference set in multiplicity based partially on said plurality of hidden units, and that the number of hidden units can be scaled according to the size of the memory unit.
  - 20. The language identification system of claim 17, characterized in that the first language discrimination module is a neural-network based system comprising a plurality of hidden units, and that the number of hidden units can be increased in order to improve the performance of the language identification system.

21. An electronic device, comprising:
- a module for providing a signal indicative of a string of alphabet characters;
  
  a language identification system, responsive to the signal, for identifying a language of the string among a plurality of languages, each of said plurality of languages having an individual set of alphabet characters, the system characterized by a reference set of alphabet characters;
  
  a mapping module for mapping the string of alphabet characters into a mapped string of alphabet characters selected from the reference set for providing a further signal indicative of the mapped string;
  
  a first language discrimination module, responsive to the further signal, for determining the likelihood of the mapped string being each one of said plurality of languages based on the reference set for providing first information indicative of the likelihood;
  
  a second language discrimination module, responsive to the first signal, for determining the likelihood of the string being each one of said plurality of languages based on the individual sets of alphabet characters for providing second information indicative of the likelihood;
  
  a decision module, responding to the first information and second information, for determining the combined likelihood of the string being one of said plurality of languages based on the first information and second information.
- View Dependent Claims (22, 24, 25)
- - 22. The device of claim 21, wherein the number of alphabet characters in the reference set is smaller than the union set of said all individual sets of alphabet characters.
  - 24. The electronic device of claim 21, comprising a hand-held device.
  - 25. The electronic device of claim 21, comprising a mobile phone.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nokia Corporation
Original Assignee
Nokia Corporation
Inventors
Tian, Jilei, Suontausta, Janne

Application Number

US10/279,747
Publication Number

US 20040078191A1
Time in Patent Office

Days
Field of Search
US Class Current

704/9
CPC Class Codes

G06F 40/263 Language identification

Scalable neural network-based language identification from written text

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

110 Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

Scalable neural network-based language identification from written text

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

110 Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links