Language Identification in Multilingual Text

US 20120095748A1
Filed: 10/14/2010
Published: 04/19/2012
Est. Priority Date: 10/14/2010
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented system for identifying multilingual text in a document using computer processor, memory, and data storage subsystems, the computer-implemented system comprising:

a code-page conversion component to identify the character encoding used by a document and to decode said encoding into a universal representative encoding via the processor;

a section breaking and classification component to divide plain-text content of the document into one or more weighted sections;

a language scoring component to obtain language likelihood scores of each word, phrase, or character n-gram in the one or more weighted sections, and to combine the obtained language likelihood scores according to language; and

an output language selection component to select a primary language for the document based upon a highest combined language likelihood score.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and media are provided for identifying languages in multilingual text. A document is decoded into a universal representative coding for easier tag manipulation, then broken into plain-text content sections. The sections are identified and assigned a weight, wherein more informative sections are given a higher weight and less informative sections are given a lesser weight. A language likelihood score is determined for each word, phrase, or character n-gram in a section. The language likelihood scores within a section are combined for each language. The combined section scores are then summed together to obtain a total document score for each language. This results in a document score for each language, which can be ranked to determine the primary language for the document.

Citations

20 Claims

1. A computer-implemented system for identifying multilingual text in a document using computer processor, memory, and data storage subsystems, the computer-implemented system comprising:
- a code-page conversion component to identify the character encoding used by a document and to decode said encoding into a universal representative encoding via the processor;
  
  a section breaking and classification component to divide plain-text content of the document into one or more weighted sections;
  
  a language scoring component to obtain language likelihood scores of each word, phrase, or character n-gram in the one or more weighted sections, and to combine the obtained language likelihood scores according to language; and
  
  an output language selection component to select a primary language for the document based upon a highest combined language likelihood score.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The computer-implemented system of claim 1, wherein the universal representative coding comprises Unicode.
  - 3. The computer-implemented system of claim 1, wherein the one or more weighted sections comprise an importance rating of a section relative to the document.
  - 4. The computer-implemented system of claim 1, wherein the plain-text content is parsed into sections based upon HTML tags, visual layout, structure, and semantic content of the document.
  - 5. The computer-implemented system of claim 1, wherein the language likelihood score is a function of one or more of language popularity, country of document origination, encoding used in the document, and document URL.
  - 6. The computer-implemented system of claim 1, wherein the language likelihood score comprises a likelihood of each word, phrase, or character n-gram belonging to one or more languages.
  - 7. The computer-implemented system of claim 1, wherein the output language selection component ranks results of the combined language likelihood scores for each language.

8. A computer-implemented method of identifying multilingual text in a document using a computing system having processor, memory, and data storage subsystems, the computer-implemented method comprising:
- isolating one or more regions of plain-text content in a document;
  
  disjoining the plain-text content into sections according to semantic and syntactic categories;
  
  assigning a weight to each of the sections;
  
  calculating a language likelihood score for each word, phrase, or character n-gram in each of the sections;
  
  computing a combined language likelihood score for each of the sections for each language; and
  
  outputting the highest ranked language from said computing as a primary language of the document.
- View Dependent Claims (9, 10, 11, 12, 13)
- - 9. The computer-implemented method of claim 8, further comprising:
    - identifying an encoding used with the document; and
      
      decoding into a universal representative code.
  - 10. The computer-implemented method of claim 8, wherein said calculating comprises:
    - calculating the language likelihood scores for each word, phrase, or character n-gram within a section multiplied by a weight of the associated section.
  - 11. The computer-implemented method of claim 10, wherein said computing further comprises:
    - computing a sum of the language likelihood scores of the document for each language.
  - 12. The computer-implemented method of claim 8, further comprising:
    - dividing each of the sections of plain-text content into segments according to a writing script used.
  - 13. The computer-implemented method of claim 12, wherein the assigning comprises:
    - assigning a weight to each of the segments.

14. One or more computer-readable storage media containing computer-readable instructions embodied thereon that, when executed by a computing device, perform a method of selecting a primary language of a multilingual document, the method comprising:
- dividing plain-text content of a document into one or more weighted script sections;
  
  determining a likelihood score for each word, phrase, or character n-gram belonging to one or more languages for each of the weighted script sections;
  
  summing all of the likelihood scores from each word, phrase, or character n-gram in a section together for each individual language to obtain one or more section language summations;
  
  combining all of the section language summations for each individual language to obtain a document score for each individual language;
  
  ranking all of the document scores; and
  
  selecting a primary document language from the highest document score.
- View Dependent Claims (15, 16, 17, 18, 19, 20)
- - 15. The media of claim 14, further comprising:
    - converting an encoding of the document into a universal representative coding.
  - 16. The media of claim 14, wherein the dividing is implemented using HTML tags.
  - 17. The media of claim 14, wherein the likelihood scores for each word, phrase, or character n-gram is obtained from a dictionary via a word-breaker.
  - 18. The media of claim 14, further comprising:
    - dividing each of the one or more weighted script sections into one or more weighted language sections.
  - 19. The media of claim 14, further comprising:
    - selecting additional languages that cover alternative alphabets or scripts that are not already covered by the primary language output.
  - 20. The media of claim 14, wherein the one or more weighted script sections are based upon importance of each section and a popularity of each language.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
LI, KANG, KLODER, STEPHEN ALLEN, JOHNSON, IAN GEORGE, ALONICHAU, SIARHEI

Granted Patent

US 8,635,061 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/8
CPC Class Codes

G06F 16/951 Indexing; Web crawling tech...

G06F 40/263 Language identification

Language Identification in Multilingual Text

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Language Identification in Multilingual Text

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links