Language Identification in Multilingual Text
First Claim
1. A computer-implemented system for identifying multilingual text in a document using computer processor, memory, and data storage subsystems, the computer-implemented system comprising:
- a code-page conversion component to identify the character encoding used by a document and to decode said encoding into a universal representative encoding via the processor;
a section breaking and classification component to divide plain-text content of the document into one or more weighted sections;
a language scoring component to obtain language likelihood scores of each word, phrase, or character n-gram in the one or more weighted sections, and to combine the obtained language likelihood scores according to language; and
an output language selection component to select a primary language for the document based upon a highest combined language likelihood score.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and media are provided for identifying languages in multilingual text. A document is decoded into a universal representative coding for easier tag manipulation, then broken into plain-text content sections. The sections are identified and assigned a weight, wherein more informative sections are given a higher weight and less informative sections are given a lesser weight. A language likelihood score is determined for each word, phrase, or character n-gram in a section. The language likelihood scores within a section are combined for each language. The combined section scores are then summed together to obtain a total document score for each language. This results in a document score for each language, which can be ranked to determine the primary language for the document.
-
Citations
20 Claims
-
1. A computer-implemented system for identifying multilingual text in a document using computer processor, memory, and data storage subsystems, the computer-implemented system comprising:
-
a code-page conversion component to identify the character encoding used by a document and to decode said encoding into a universal representative encoding via the processor; a section breaking and classification component to divide plain-text content of the document into one or more weighted sections; a language scoring component to obtain language likelihood scores of each word, phrase, or character n-gram in the one or more weighted sections, and to combine the obtained language likelihood scores according to language; and an output language selection component to select a primary language for the document based upon a highest combined language likelihood score. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer-implemented method of identifying multilingual text in a document using a computing system having processor, memory, and data storage subsystems, the computer-implemented method comprising:
-
isolating one or more regions of plain-text content in a document; disjoining the plain-text content into sections according to semantic and syntactic categories; assigning a weight to each of the sections; calculating a language likelihood score for each word, phrase, or character n-gram in each of the sections; computing a combined language likelihood score for each of the sections for each language; and outputting the highest ranked language from said computing as a primary language of the document. - View Dependent Claims (9, 10, 11, 12, 13)
-
-
14. One or more computer-readable storage media containing computer-readable instructions embodied thereon that, when executed by a computing device, perform a method of selecting a primary language of a multilingual document, the method comprising:
-
dividing plain-text content of a document into one or more weighted script sections; determining a likelihood score for each word, phrase, or character n-gram belonging to one or more languages for each of the weighted script sections; summing all of the likelihood scores from each word, phrase, or character n-gram in a section together for each individual language to obtain one or more section language summations; combining all of the section language summations for each individual language to obtain a document score for each individual language; ranking all of the document scores; and selecting a primary document language from the highest document score. - View Dependent Claims (15, 16, 17, 18, 19, 20)
-
Specification