Determining a natural language shift in a computer document
First Claim
1. A method for detecting language shift points in a computer document written in a plurality of natural languages, comprising the steps of:
- moving an interval through a text document in a computer memory, the interval containing a plurality of words in the document;
for each position of the interval, determining a probability that text in the interval is written in each of a plurality of candidate languages according to a respective number of matches of words in the interval with words in each of a plurality of word lists of a few common words selected from each respective candidate language;
for a first position of the interval, classifying a first candidate language having the highest probability as the current language within the interval;
finding a language shift point in the document where the probability that a second candidate language is higher than the current language for a new position of the interval; and
classifying the second candidate language as the current language in the document after the language shift point.
1 Assignment
0 Petitions
Accused Products
Abstract
Language shift points in a computer document written in a plurality of natural languages are determined. An interval is defined on and moved through a text document in a computer memory, the interval contains a portion of the text in the document. As the interval is moved through the document for each position of the interval, a probability that the text in the interval is written in each of a plurality of candidate languages is determined for the position. For the first position of the interval, generally the beginning of the document, a first candidate language is classified as the current language if it has the highest probability of all the candidate languages within the interval. A language shift point in the document is identified where the relative probability of a second candidate language is higher than the current language at a new position of the interval. At this point, the second candidate language is classified as the current language in the document after the language shift point. The process continues to identify other language shift points in the document.
-
Citations
26 Claims
-
1. A method for detecting language shift points in a computer document written in a plurality of natural languages, comprising the steps of:
-
moving an interval through a text document in a computer memory, the interval containing a plurality of words in the document; for each position of the interval, determining a probability that text in the interval is written in each of a plurality of candidate languages according to a respective number of matches of words in the interval with words in each of a plurality of word lists of a few common words selected from each respective candidate language; for a first position of the interval, classifying a first candidate language having the highest probability as the current language within the interval; finding a language shift point in the document where the probability that a second candidate language is higher than the current language for a new position of the interval; and classifying the second candidate language as the current language in the document after the language shift point. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A system including processor and memory for detecting language shift points in a computer document written in a plurality of natural languages, comprising:
-
interval defining means for moving an interval through a computer document in the memory, the interval containing a plurality of words in the document; language comparing means for determining for each position of the interval a probability that text in the interval is written in each of a plurality of candidate languages according to a respective number of matches of words in the interval with words in each of a plurality of word lists of a few common words selected from each respective candidate language; language determining means for determining for each position of the interval the candidate language having the highest relative probability within the interval; language shift determining means for finding language shift points in the document according to the positions of the interval where the candidate language having the highest relative probability changes from prior positions. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17)
-
-
18. A computer program product in a computer readable medium for detecting language shift points in a computer document written in a plurality of natural languages, comprising:
-
means for moving an interval through a computer document in a computer memory, the interval containing a plurality of words in the document; means for determining a probability that text in the interval is written in each of a plurality of candidate languages within the interval according to a respective number of matches of words in the interval with words in each of a plurality of word lists of a few common words selected from each respective candidate language; and means for finding language shift points in the document according to changes in the relative probabilities of the candidate language as the interval changes positions in the document. - View Dependent Claims (19, 20, 21, 22, 23)
-
-
24. A method for detecting language shift points in a computer document written in a plurality of natural languages, comprising the steps of:
-
selecting successive pluralities of words located at successive locations in the document; for each plurality of words, recognizing no words or one or more words, but not all words, as being members of each of a respective candidate language; for each plurality of words, classifying the plurality of words as being written in a respective language according to a greatest number of words in the respective plurality of words being recognized as members of the respective language; and finding language shift points in the document according to where the respective languages of successive pluralities of words change according to the classifying step. - View Dependent Claims (25)
-
-
26. A method for detecting language shift points in a computer document written in a plurality of natural languages, comprising the steps of:
-
selecting successive pluralities of words located at successive locations in the document, wherein each plurality of words includes a first number of words of which a second number of words is shared by a next plurality of words; for each plurality of words, classifying the plurality of words as being written in a respective language according to a greatest number of words in the respective plurality of words being recognized as members of the respective language; and finding language shift points in the document according to where the respective languages of successive pluralities of words change according to the classifying step.
-
Specification