Language identification process using coded language words

US 5,548,507 A
Filed: 03/14/1994
Issued: 08/20/1996
Est. Priority Date: 03/14/1994
Status: Expired due to Fees

First Claim

Patent Images

1. A machine process for identifying a human language used in a computer coded document from text in the document, comprising the steps ofreading a sequence of words from the document,comparing each word obtained by the reading step to words in a plurality of Word Frequency Tables (WFTs) respectively associated with languages of interest, each WFT containing a set of most frequently used words in an associated language, and each word in a WFT having an associated numerical value representing a previously determined frequency of occurrence (FO) value for the word in a sample of documents written in the associated language,associating a Word frequency Accumulator (WFA) with each WFT, and resetting each WFA to a predetermined WFA value prior to reading each document by the reading step,outputting the FO value associated with each word matched by the comparing step with a word read by the reading step,inputting each FO value provided by the outputting step to the associated WFA,adding each FO value to a current sum contained in the associated WFA to generate an accumulated amount,detecting which of the plural WFAs has the largest accumulated amount, andidentifying the human language associated with the WFA detected to have the largest accumulated value.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Provides a process which identifies the language or genre of a stored or transmitted document. The process uses a plurality of Word Frequency Tables (WFTs) respectively associated with languages/genre of interest. Each WFT contains a relatively few of the most common words of one of the languages of interest. Each word code in a WFT has an associated normalized frequency of occurrence value (NFO); use of NFOs increases the language/genre detection ability of the process. A plurality of respective accumulators are associated with the plurality of WFTs. All accumulators are set to zero before identification processing starts. The language/genre identification process receives a sequence of words from an inputted document, and compares each received word to all of the words in all WFTs. Whenever a received word is found in any WFT, the process adds the word'"'"'s associated NFO to a current total in the associated accumulator. In this manner, totals in all accumulators build up into language discriminating values after a number of words are read from the document. Processing stops when either the end of the document is reached or when a predetermined number of words are received; and then the language/genre associated with the accumulator containing the largest total is the identified language.

274 Citations

9 Claims

1. A machine process for identifying a human language used in a computer coded document from text in the document, comprising the steps ofreading a sequence of words from the document,comparing each word obtained by the reading step to words in a plurality of Word Frequency Tables (WFTs) respectively associated with languages of interest, each WFT containing a set of most frequently used words in an associated language, and each word in a WFT having an associated numerical value representing a previously determined frequency of occurrence (FO) value for the word in a sample of documents written in the associated language,associating a Word frequency Accumulator (WFA) with each WFT, and resetting each WFA to a predetermined WFA value prior to reading each document by the reading step,outputting the FO value associated with each word matched by the comparing step with a word read by the reading step,inputting each FO value provided by the outputting step to the associated WFA,adding each FO value to a current sum contained in the associated WFA to generate an accumulated amount,detecting which of the plural WFAs has the largest accumulated amount, andidentifying the human language associated with the WFA detected to have the largest accumulated value.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. A machine process for identifying a human language used in a computer coded document from text in the document as defined in claim 1, further comprising the steps offinding the largest FO value of all words in each WFT,normalizing the FO value for each WFT by dividing each FO value by the largest FO value found in the WFT to generate a normalized frequency of occurrence (NFO) for the word, andreplacing each FO value with the NFO value determined by the normalizing step.
  - 3. A machine process for identifying a human language used in a computer coded document from text in the document as defined in claim 2, further comprising the steps ofinserting one or more special words in any WFT which are not in the other WFTs, each non-special word in each WFT being also found in at least one other WFT, andstoring an FO value for each special word larger than the FO value of any non-special word in the WFT.
  - 4. A machine process for identifying a human language used in a computer coded document from text in the document as defined in claim 3, further comprising the steps ofcomparing each word in WFT with words in each of the other WFTs to find each special word and each non-special word in each WFT, andassigning a larger FO value to each special word found by the comparing step than the FO value provided for any non-special word in the WFT.
  - 5. A machine process for identifying a human language used in a computer coded document from text in the document as defined in claim 2, further comprising the steps ofgenerating a word frequency table (WFT) by reading a plurality of sampled documents known to be in a language of interest for which the WFT is to be generated,counting number of occurrances for each word read in the sampled documents by the generating step to establish a FO value associated with each word in the WFT, andretaining in the WFT language the words having associated FO values exceeding a threshold, and the WFTs each having approximately the same total value for all FOs in each WFT.
  - 6. A machine process for identifying a human language used in a computer coded document from text in the document as defined in claim 5, further comprising the steps ofsetting the threshold for a minimum number of words which must be read from a document before a language identification can be made,counting the words read from the document, andmaking a language identification only if the count exceeds the threshold.
  - 7. A machine process for identifying a human language used in a computer coded document from text in the document as defined in claim 6, further comprising the steps ofsetting the threshold for a largest WFA value which can identify a language, andidentifying the language of the document when the largest WFA value exceeds the next-largest WFA value by more than an established threshold and the word count exceeds the threshold.
  - 8. A machine process for identifying a human language used in a computer coded document from text in the document, as defined in claim 5, further comprising the steps ofgenerating any WFT to represent a genre within a language instead of, or in addition to, representing the language by a WFT, andestablishing the FO values associated with words in the WFT from word frequencies in a sampling of documents representing the genre.
  - 9. A machine process for identifying a human language used in a computer coded document from text in the document as defined in claim 1, further comprising the steps ofestablishing an established range of word-lengths defined as the count of letters in each word as an initial step,determining a word-length for each word by counting the letters in each word in each WFT,comparing the word-length for each word with the established range, andremoving from the WFT any word and its associated FO value when word-length is not within the established range.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Paulsen, Robert C. Jr., Martino, Michael J.
Primary Examiner(s)
Weinhardt, Robert A.

Application Number

US08/212,490
Time in Patent Office

890 Days
Field of Search

364/419.01, 364/419.02, 364/419.08, 364/419.10, 364/419.11
US Class Current

704/1
CPC Class Codes

G06F 40/216 using statistical methods

G06F 40/263 Language identification

Language identification process using coded language words

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

274 Citations

9 Claims

Specification

Use Cases

Quick Links

Others

Language identification process using coded language words

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

274 Citations

9 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others