Language identification for documents containing multiple languages

US 8,938,384 B2
Filed: 07/16/2012
Issued: 01/20/2015
Est. Priority Date: 11/19/2008
Status: Expired due to Fees

First Claim

Patent Images

1. A method of identifying one or more languages for a document, the languages being selected from a set of candidate languages, the method comprising:

dividing the set of candidate languages into a plurality of disjoint subsets, wherein any two languages that are in different disjoint subsets do not overlap with each other;

segmenting the document into one or more segments (t) of consecutive characters, wherein each segment t contains n-grams that have greater than a default probability of occurrence only for languages in an active one of the disjoint subsets (A_t);

for each segment t, generating a segment score (S_t(L)) for each language (L) in the active one of the disjoint subsets A_t;

identifying, by a processor, one or more languages as being languages of the document based on the segment scores S_t(L) for all of the segments t and languages L; and

storing, in a computer readable storage device, information indicating the one or more languages of the document.

View all claims

10 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Multiple nonoverlapping languages within a single document can be identified. In one embodiment, for each of a set of candidate languages, a set of non-overlapping languages is defined. The document is analyzed under the hypothesis that the whole document is in one language and that part of the document is in one language while the rest is in a different, non-overlapping language. Language(s) of the document are identified based on comparing these competing hypotheses across a number of language pairs. In another embodiment, transitions between non-overlapping character sets are used to segment a document, and each segment is scored separately for a subset of candidate languages. Language(s) of the document are identified based on the segment scores.

Citations

20 Claims

1. A method of identifying one or more languages for a document, the languages being selected from a set of candidate languages, the method comprising:
- dividing the set of candidate languages into a plurality of disjoint subsets, wherein any two languages that are in different disjoint subsets do not overlap with each other;
  
  segmenting the document into one or more segments (t) of consecutive characters, wherein each segment t contains n-grams that have greater than a default probability of occurrence only for languages in an active one of the disjoint subsets (A_t);
  
  for each segment t, generating a segment score (S_t(L)) for each language (L) in the active one of the disjoint subsets A_t;
  
  identifying, by a processor, one or more languages as being languages of the document based on the segment scores S_t(L) for all of the segments t and languages L; and
  
  storing, in a computer readable storage device, information indicating the one or more languages of the document.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1 wherein two languages do not have overlap with each other in the event that the two languages have no characters in common with each other.
  - 3. The method of claim 1 wherein two languages do not have overlap with each other in the event that the two languages have no bigrams in common with each other.
  - 4. The method of claim 1 wherein the n-grams are bigrams.
  - 5. The method of claim 4 wherein generating the segment score St(L) for each segment t includes computing, for each language L that is in the active one of the disjoint subsets At:
  - 6. The method of claim 1 wherein segmenting the document includes:
    - reading a first n-gram from the document;
      
      identifying one of the disjoint subsets as a first active subset for the document, wherein the first n-gram has greater than the default probability of occurrence for at least one of the languages in the first active subset;
      
      continuing to read successive n-grams from the document until a transition n-gram is encountered, wherein the transition n-gram does not have greater than the default probability of occurrence for any of the languages in the first active subset and that does have greater than the default probability of occurrence for at least one of the languages in a second active subset;
      
      identifying as a first segment the portion of the document from the first n-gram to the transition n-gram; and
      
      identifying as a second segment a portion of the document that begins with the transition n-gram.
  - 7. The method of claim 1 wherein identifying one or more languages as being languages of the document includes:
    - for each segment t, identifying the language L that has the best segment score S_t(L) as a language of the document.
  - 8. The method of claim 1 wherein identifying one or more languages as being languages of the document includes:
    - for each segment t, identifying the language Lt that has the best segment score St(L);
      
      determining whether the segment score St(Lt) for the language Lt satisfies a threshold criterion; and
      
      identifying the language Lt as a language of the document in the event that the segment score St(Lt) for the language Lt satisfies the threshold criterion.
  - 9. The method of claim 1 wherein identifying one or more languages as being languages of the document includes:
    - determining whether each segment t is a long segment or a short segment;
      
      for each segment t that is a long segment;
      
      identifying the language L that has the best segment score S_t(L) as a language of the document; and
      
      for each segment t that is a short segment;
      
      determining whether any one or more other short segments k have the same active subset (A_t) as the segment t;
      
      for each other short segment k that has the same active subset A_tas the segment t, combining the other segment score S_k(L) with the segment score S_t(L) for each language L in active subset A_tto determine an aggregate score for language L; and
      
      identifying the language L_tthat has the best aggregate score as a language of the document.
  - 10. The method of claim 1 wherein identifying one or more languages as being languages of the document includes:
    - determining whether each segment t is a long segment or a short segment;
      
      for each segment t that is a long segment;
      
      identifying the language Lt that has the best segment score St(L);
      
      determining whether the segment score St(Lt) for the language Lt satisfies a threshold criterion; and
      
      identifying the language Lt as a language of the document in the event that the segment score St(Lt) for the language Lt satisfies the threshold criterion; and
      
      for each segment t that is a short segment;
      
      determining whether any one or more other short segments k have the same active subset (At) as the segment t;
      
      for each other short segment k that has the same active subset At as the segment t, aggregating the other segment score S_k(L) with the segment score St(L) for each language L in active subset At to determine an aggregate score for language L;
      
      identifying the language Lt that has the best aggregate score;
      
      determining whether the aggregate score satisfies a threshold criterion; and
      
      identifying the language Lt as a language of the document in the event that the aggregate score for the language Lt satisfies the threshold criterion.

11. A system for identifying one or more languages in a document, the system comprising:
- a language model data store configured to store an n-gram based language model for each of a plurality of languages, wherein the plurality of languages belong to a plurality of disjoint subsets, wherein any two languages that are in different disjoint subsets do not overlap with each other;
  
  a document information data store configured to store information for each of a plurality of documents, the information including language identifying information indicating one or more languages associated with the document; and
  
  a processor coupled to the language model data store and the document information data store, the processor being configured to execute language identification processes, the language identification processes including;
  
  a first process that, when executed, segments a test document into one or more segments of consecutive characters, wherein each segment contains n-grams that have greater than a default probability of occurrence only for languages in a same one of the plurality of disjoint subsets, and further generates a set of segment scores for the test document, wherein the set of segment scores includes a score for each one of the segments scored against each one of the language models in the one of the plurality of disjoint subsets applicable to that segment; and
  
  a second process that, when executed, identifies one or more of the plurality of languages as being languages of the documents based on the set of segment scores.
- View Dependent Claims (12, 13, 14, 15, 16, 17)
- - 12. The system of claim 11 wherein the second process, when executed, further stores information indicating the identified one or more languages for the test document in the document information data store.
  - 13. The system of claim 11 wherein the n-grams are bigrams.
  - 14. The system of claim 11 wherein the processor is further configured to execute a first process that, when executed, defines the plurality of disjoint subsets based on the n-gram based language models stored in the language model data store.
  - 15. The system of claim 11 wherein two languages do not overlap with each other in the event that the two or more languages have no characters in common.
  - 16. The system of claim 11 wherein two languages do not overlap with each other in the event that the respective n-gram based language models for the two languages have no n-grams in common.
  - 17. The system of claim 11 wherein the first process, when executed, segments the test document based at least in part on detecting a transition within a sequence of n-grams of the document from a current n-gram that has greater than a default probability of occurrence in at least one language in a current one of the plurality of disjoint subsets to a next n-gram that does not have greater than a default probability of occurrence in at least one language in the current one of the plurality of disjoint subsets.

18. A non-transitory computer readable medium on which is stored machine readable instructions that when executed by a processor implement a method of identifying one or more languages for a document, the languages being selected from a set of candidate languages, the machine readable instructions comprising code to:
- divide the set of candidate languages into a plurality of disjoint subsets, wherein any two languages that are in different disjoint subsets do not overlap with each other;
  
  segment the document into one or more segments (t) of consecutive characters, wherein each segment t contains n-grams that have greater than a default probability of occurrence only for languages in an active one of the disjoint subsets (A_t);
  
  for each segment t, generate a segment score (S_t(L)) for each language (L) in the active one of the disjoint subsets A_t;
  
  identify one or more languages as being languages of the document based on the segment scores S_t(L) for all of the segments t and languages L; and
  
  store, in a computer readable storage device, information indicating the one or more languages of the document.
- View Dependent Claims (19, 20)
- - 19. The non-transitory computer readable medium of claim 18, wherein two languages do not have overlap with each other in the event that the two languages have no characters in common with each other.
  - 20. The non-transitory computer readable medium of claim 18, wherein two languages do not have overlap with each other in the event that the two languages have no bigrams in common with each other.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Micro Focus LLC (Open Text Corporation)
Original Assignee
Stratify, Inc. (Open Text Corporation)
Inventors
Goswami, Sauraj
Primary Examiner(s)
Han, Qi

Application Number

US13/550,346
Publication Number

US 20130191111A1
Time in Patent Office

918 Days
Field of Search

None
US Class Current

704/8
CPC Class Codes

G06F 40/263 Language identification

G06F 40/58 Use of machine translation,...

Language identification for documents containing multiple languages

First Claim

10 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Language identification for documents containing multiple languages

First Claim

10 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links