Language identification for documents containing multiple languages
First Claim
1. A method of identifying one or more languages for a document, the languages being selected from a set of candidate languages, the method comprising:
- for each language (M) in the set of candidate languages, defining a set of non overlapping languages (N(M)), the set N(M) consisting of one or more languages (L), wherein each language L in the set N(M) does not overlap with the language M;
obtaining n-gram data for a target document;
for each language M in the set of candidate languages, using the n-gram data to determine a final score SF(M) based on relative probabilities of a first hypothesis that the target document is entirely in the language M and a second hypothesis that one portion of the target document is in the language M while another portion of the target document is in a language L selected from the set N(M);
identifying, by a processor, one or more of the candidate languages as being languages of the document based on the final scores SF(M) for different languages M; and
storing, in a computer readable storage device, information indicating the one or more languages of the document.
10 Assignments
0 Petitions
Accused Products
Abstract
Multiple nonoverlapping languages within a single document can be identified. In one embodiment, for each of a set of candidate languages, a set of non-overlapping languages is defined. The document is analyzed under the hypothesis that the whole document is in one language and that part of the document is in one language while the rest is in a different, non-overlapping language. Language(s) of the document are identified based on comparing these competing hypotheses across a number of language pairs. In another embodiment, transitions between non-overlapping character sets are used to segment a document, and each segment is scored separately for a subset of candidate languages. Language(s) of the document are identified based on the segment scores.
33 Citations
30 Claims
-
1. A method of identifying one or more languages for a document, the languages being selected from a set of candidate languages, the method comprising:
-
for each language (M) in the set of candidate languages, defining a set of non overlapping languages (N(M)), the set N(M) consisting of one or more languages (L), wherein each language L in the set N(M) does not overlap with the language M; obtaining n-gram data for a target document; for each language M in the set of candidate languages, using the n-gram data to determine a final score SF(M) based on relative probabilities of a first hypothesis that the target document is entirely in the language M and a second hypothesis that one portion of the target document is in the language M while another portion of the target document is in a language L selected from the set N(M); identifying, by a processor, one or more of the candidate languages as being languages of the document based on the final scores SF(M) for different languages M; and storing, in a computer readable storage device, information indicating the one or more languages of the document. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A system for identifying one or more languages in a document, the system comprising:
-
a language model data store configured to store an n-gram based language model for each of a plurality of languages; a document information data store configured to store information for each of a plurality of documents, the information including language identifying information indicating one or more languages associated with the document; and a processor coupled to the language model data store and the document information data store, the processor being configured to execute language identification processes, the language identification processes including; a first process that, when executed, generates n-gram data for a test document; a second process that, when executed, generates a final score for the test document based on the n-gram data and the n-gram based language model for a candidate language M, wherein the final score is based on relative probabilities of a first hypothesis that the target document is entirely in the candidate language M and a second hypothesis that one portion of the target document is in the candidate language M while another portion of the target document is in an alternative language L wherein the alternative language L does not overlap with the candidate language M; and a third process that, when executed, identifies one or more languages for the test document based on the final scores produced by the second process for each of a plurality of candidate languages. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
-
-
20. A non-transitory computer readable medium on which is stored machine readable instructions that when executed by a processor implement a method of identifying one or more languages for a document, the languages being selected from a set of candidate languages, the machine readable instructions comprising code to:
-
for each language (M) in the set of candidate languages, define a set of non overlapping languages (N(M)), the set N(M) consisting of one or more languages (L), wherein each language L in the set N(M) does not overlap with the language M; obtain n-gram data for a target document; for each language M in the set of candidate languages, use the n-gram data to determine a final score SF(M) based on relative probabilities of a first hypothesis that the target document is entirely in the language M and a second hypothesis that one portion of the target document is in the language M while another portion of the target document is in a language L selected from the set N(M); identify one or more of the candidate languages as being languages of the document based on the final scores SF(M) for different languages M; and store information indicating the one or more languages of the document. - View Dependent Claims (21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
-
Specification