Trigram-based method of language identification

US 5,062,143 A
Filed: 02/23/1990
Issued: 10/29/1991
Est. Priority Date: 02/23/1990
Status: Expired due to Term

First Claim

Patent Images

1. A method of determining in what language a body of text is written comprising the steps of:

(a) parsing said body of text into a plurality of trigrams so that at least some of the trigrams overlap adjacent words, each trigram comprising the contents of three successive character/space positions of said body of text;

(b) comparing each of the trigrams that has been parsed from said body of text in step (a) with a plurality of trigram key sets, each respective trigram key set being associated with a respectively different language and containing those trigrams that have been predetermined to occur at a frequency that is at least equal to a prescribed frequency of occurrence of trigrams for that respective language; and

(c) in response to the ratio of the number of trigrams of said body of text compared in step (b), that correspond to trigrams of a respective key set, to the total number of trigrams of said body of text being at least equal to a prescribed value and greater than such ratios for alternative languages, identifying the body of text as being written in the language associated with said respective key set.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A mechanism for examining a body of text and identifying its language compares successive trigrams into which the body of text is parsed with a library of sets of trigrams. For a respective language-specific key set of trigrams, if the ratio of the number of trigrams in the text, for which a match in the key set has been found, to the total number of trigrams in the text is at least equal to a prescribed value, then the text is identified as being possibly written in the language associated with that respective key set. Each respective trigram key set is associated with a respectively different language and contains those trigrams that have been predetermined to occur at a frequency that is at least equal to a prescribed frequency of occurrence of trigrams for that respective language. Successive key sets for other languages are processed as above, and the language for which the percentage of matches is greatest, and for which the percentage exceeded the prescribed value as above, is selected as the language in which the body of text is written.

248 Citations

6 Claims

1. A method of determining in what language a body of text is written comprising the steps of:
- (a) parsing said body of text into a plurality of trigrams so that at least some of the trigrams overlap adjacent words, each trigram comprising the contents of three successive character/space positions of said body of text;
  
  (b) comparing each of the trigrams that has been parsed from said body of text in step (a) with a plurality of trigram key sets, each respective trigram key set being associated with a respectively different language and containing those trigrams that have been predetermined to occur at a frequency that is at least equal to a prescribed frequency of occurrence of trigrams for that respective language; and
  
  (c) in response to the ratio of the number of trigrams of said body of text compared in step (b), that correspond to trigrams of a respective key set, to the total number of trigrams of said body of text being at least equal to a prescribed value and greater than such ratios for alternative languages, identifying the body of text as being written in the language associated with said respective key set.
- View Dependent Claims (2, 3)
- - 2. A method according to claim 1, wherein the prescribed frequency of occurrence in step (b) is established in accordance with a measured probability of occurrence of every trigram capable of occurring in that language.
  - 3. A method according to claim 1, wherein a respective one of the plurality of trigram key sets employed in step (b) is generated by the steps of:
    - (i) parsing a section of text of a prescribed language into a plurality of trigrams, each of which is comprised of the contents of three successive character/space positions of said section of text;
      
      (ii) counting the number of occurrences of each of the trigrams that has been parsed from said section of text in step (i);
      
      (iii) determining the ratio of each of the number of occurrences of the trigrams counted in step (ii) with the total number of trigrams into which said section of text has been parsed in step (i), and deriving therefrom a characteristic representative of the frequency of trigram occurrence of each trigram that may be formed using the characters of said prescribed language and a space position;
      
      (iv) from the characteristic derived in step (iii), identifying the frequency of occurrence of trigrams for said prescribed language that is associated with a selected frequency of occurrence; and
      
      (v) generating, as said key set of trigrams, those triggers whose frequency of occurrence is at least equal to the frequency of occurrence identified in step (iv).

4. A method of determining in what language a body of text is written, said body of text containing N sequential character/space position, comprising the steps of:
- (a) parsing said body of text into each of (N-2) trigrams that are sequentially definable by said N sequential character/space positions so that at least some of the trigrams overlap adjacent words;
  
  (b) comparing each of the trigrams parsed in step (a) with a plurality of trigrams key sets, each respective trigram key set being associated with a respectively different language and containing those trigrams that have been predetermined to occur at a frequency that is at least equal to a prescribed frequency of occurrence of trigrams for that respective language; and
  
  (c) in response to the ratio of the number of trigrams of said body of text compared in step (b), that correspond to trigrams of a respective key set, to the total number of trigrams of said body of text, being at least equal to a prescribed value and exceeding such ratios for alterative languages, identifying the body of text as being written in the language associated with said respective set.
- View Dependent Claims (5)
- - 5. A method according to claim 4, wherein the prescribed frequency of occurrence in step (b) is established in accordance with a measured probability of occurrence of every trigram capable of occurring in that language.

6. A method of generating a key set of trigrams to be used in determining in what language text has been written, based upon a comparison of trigrams, into which said text is to be parsed, with said set of trigrams comprising the steps of:
- (a) parsing a body of text of a prescribed language into a plurality of trigrams so that at least some of the trigrams overlap adjacent words, each trigram comprising the contents of three successive character/space positions of said body of text;
  
  (b) counting the number of occurrences of each of the trigrams that has been parsed from said body of text in step (a);
  
  (c) determining the ratio of each of the number of occurrence of the trigrams counted in step (b) with the total number of trigrams into which said body of text has been parsed in step (a), and deriving therefrom a characteristic representative of the frequency of trigram occurrence of each trigram that may be formed using the characters of said prescribed language and a space position;
  
  (d) from the characteristic derived in step (c), identifying the frequency of occurrence of trigrams for said prescribed language that is associated with a selected frequency of occurrence; and
  
  (e) generating, as said key set of trigrams, those trigrams whose frequency of occurrence is at least equal to the frequency of occurrence identified in step (d).

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Harris Corporation (L3Harris Technologies, Inc.)
Original Assignee
Harris Corporation (L3Harris Technologies, Inc.)
Inventors
Schmitt, John C.
Primary Examiner(s)
Boudreau, Leo H.
Assistant Examiner(s)
Fallon, Steven P.

Application Number

US07/485,115
Time in Patent Office

613 Days
Field of Search

382/36, 382/40, 382/9, 382/37, 382/38, 382/39
US Class Current

382/230
CPC Class Codes

G06F 40/263 Language identification

G06F 40/289 Phrasal analysis, e.g. fini...

Trigram-based method of language identification

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

248 Citations

6 Claims

Specification

Solutions

Use Cases

Quick Links

Trigram-based method of language identification

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

248 Citations

6 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links