Language identification process using coded language words
First Claim
1. A machine process for identifying a human language used in a computer coded document from text in the document, comprising the steps ofreading a sequence of words from the document,comparing each word obtained by the reading step to words in a plurality of Word Frequency Tables (WFTs) respectively associated with languages of interest, each WFT containing a set of most frequently used words in an associated language, and each word in a WFT having an associated numerical value representing a previously determined frequency of occurrence (FO) value for the word in a sample of documents written in the associated language,associating a Word frequency Accumulator (WFA) with each WFT, and resetting each WFA to a predetermined WFA value prior to reading each document by the reading step,outputting the FO value associated with each word matched by the comparing step with a word read by the reading step,inputting each FO value provided by the outputting step to the associated WFA,adding each FO value to a current sum contained in the associated WFA to generate an accumulated amount,detecting which of the plural WFAs has the largest accumulated amount, andidentifying the human language associated with the WFA detected to have the largest accumulated value.
1 Assignment
0 Petitions
Accused Products
Abstract
Provides a process which identifies the language or genre of a stored or transmitted document. The process uses a plurality of Word Frequency Tables (WFTs) respectively associated with languages/genre of interest. Each WFT contains a relatively few of the most common words of one of the languages of interest. Each word code in a WFT has an associated normalized frequency of occurrence value (NFO); use of NFOs increases the language/genre detection ability of the process. A plurality of respective accumulators are associated with the plurality of WFTs. All accumulators are set to zero before identification processing starts. The language/genre identification process receives a sequence of words from an inputted document, and compares each received word to all of the words in all WFTs. Whenever a received word is found in any WFT, the process adds the word'"'"'s associated NFO to a current total in the associated accumulator. In this manner, totals in all accumulators build up into language discriminating values after a number of words are read from the document. Processing stops when either the end of the document is reached or when a predetermined number of words are received; and then the language/genre associated with the accumulator containing the largest total is the identified language.
274 Citations
9 Claims
-
1. A machine process for identifying a human language used in a computer coded document from text in the document, comprising the steps of
reading a sequence of words from the document, comparing each word obtained by the reading step to words in a plurality of Word Frequency Tables (WFTs) respectively associated with languages of interest, each WFT containing a set of most frequently used words in an associated language, and each word in a WFT having an associated numerical value representing a previously determined frequency of occurrence (FO) value for the word in a sample of documents written in the associated language, associating a Word frequency Accumulator (WFA) with each WFT, and resetting each WFA to a predetermined WFA value prior to reading each document by the reading step, outputting the FO value associated with each word matched by the comparing step with a word read by the reading step, inputting each FO value provided by the outputting step to the associated WFA, adding each FO value to a current sum contained in the associated WFA to generate an accumulated amount, detecting which of the plural WFAs has the largest accumulated amount, and identifying the human language associated with the WFA detected to have the largest accumulated value.
Specification