Language identification for documents containing multiple languages
First Claim
1. A method of identifying one or more languages for a document, the languages being selected from a set of candidate languages, the method comprising:
- dividing the set of candidate languages into a plurality of disjoint subsets, wherein any two languages that are in different disjoint subsets do not overlap with each other;
segmenting the document into one or more segments (t) of consecutive characters, wherein each segment t contains n-grams that have greater than a default probability of occurrence only for languages in an active one of the disjoint subsets (At);
for each segment t, generating a segment score (St(L)) for each language (L) in the active one of the disjoint subsets At;
identifying, by a processor, one or more languages as being languages of the document based on the segment scores St(L) for all of the segments t and languages L; and
storing, in a computer readable storage device, information indicating the one or more languages of the document.
10 Assignments
0 Petitions
Accused Products
Abstract
Multiple nonoverlapping languages within a single document can be identified. In one embodiment, for each of a set of candidate languages, a set of non-overlapping languages is defined. The document is analyzed under the hypothesis that the whole document is in one language and that part of the document is in one language while the rest is in a different, non-overlapping language. Language(s) of the document are identified based on comparing these competing hypotheses across a number of language pairs. In another embodiment, transitions between non-overlapping character sets are used to segment a document, and each segment is scored separately for a subset of candidate languages. Language(s) of the document are identified based on the segment scores.
-
Citations
20 Claims
-
1. A method of identifying one or more languages for a document, the languages being selected from a set of candidate languages, the method comprising:
-
dividing the set of candidate languages into a plurality of disjoint subsets, wherein any two languages that are in different disjoint subsets do not overlap with each other; segmenting the document into one or more segments (t) of consecutive characters, wherein each segment t contains n-grams that have greater than a default probability of occurrence only for languages in an active one of the disjoint subsets (At); for each segment t, generating a segment score (St(L)) for each language (L) in the active one of the disjoint subsets At; identifying, by a processor, one or more languages as being languages of the document based on the segment scores St(L) for all of the segments t and languages L; and storing, in a computer readable storage device, information indicating the one or more languages of the document. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system for identifying one or more languages in a document, the system comprising:
-
a language model data store configured to store an n-gram based language model for each of a plurality of languages, wherein the plurality of languages belong to a plurality of disjoint subsets, wherein any two languages that are in different disjoint subsets do not overlap with each other; a document information data store configured to store information for each of a plurality of documents, the information including language identifying information indicating one or more languages associated with the document; and a processor coupled to the language model data store and the document information data store, the processor being configured to execute language identification processes, the language identification processes including; a first process that, when executed, segments a test document into one or more segments of consecutive characters, wherein each segment contains n-grams that have greater than a default probability of occurrence only for languages in a same one of the plurality of disjoint subsets, and further generates a set of segment scores for the test document, wherein the set of segment scores includes a score for each one of the segments scored against each one of the language models in the one of the plurality of disjoint subsets applicable to that segment; and a second process that, when executed, identifies one or more of the plurality of languages as being languages of the documents based on the set of segment scores. - View Dependent Claims (12, 13, 14, 15, 16, 17)
-
-
18. A non-transitory computer readable medium on which is stored machine readable instructions that when executed by a processor implement a method of identifying one or more languages for a document, the languages being selected from a set of candidate languages, the machine readable instructions comprising code to:
-
divide the set of candidate languages into a plurality of disjoint subsets, wherein any two languages that are in different disjoint subsets do not overlap with each other; segment the document into one or more segments (t) of consecutive characters, wherein each segment t contains n-grams that have greater than a default probability of occurrence only for languages in an active one of the disjoint subsets (At); for each segment t, generate a segment score (St(L)) for each language (L) in the active one of the disjoint subsets At; identify one or more languages as being languages of the document based on the segment scores St(L) for all of the segments t and languages L; and store, in a computer readable storage device, information indicating the one or more languages of the document. - View Dependent Claims (19, 20)
-
Specification