Automatic orthographic transformation of a text stream
First Claim
Patent Images
1. A method of automatically rewriting orthography of a stream of text words comprising:
- if a word in the stream has an entry in an orthography rewrite lexicon, automatically replacing the word with an orthographically rewritten form of the word from the orthography rewrite lexicon;
selecting words in the stream; and
comparing the selected words to a plurality of features weighted by a maximum entropy-based algorithm, to automatically determine whether to rewrite orthography of any of the selected words.
7 Assignments
0 Petitions
Accused Products
Abstract
A method is given for automatically rewriting orthography of a stream of text words, for example, automatically and properly capitalizing words in the stream. If a word in the stream has an entry in an orthography rewrite lexicon, the word is automatically replaced with an orthographically rewritten form of the word from the orthography rewrite lexicon. In addition, selected words in the stream are compared to a plurality of features weighted by a maximum entropy-based algorithm, to automatically determine whether to rewrite orthography of any of the selected words.
100 Citations
8 Claims
-
1. A method of automatically rewriting orthography of a stream of text words comprising:
-
if a word in the stream has an entry in an orthography rewrite lexicon, automatically replacing the word with an orthographically rewritten form of the word from the orthography rewrite lexicon;
selecting words in the stream; and
comparing the selected words to a plurality of features weighted by a maximum entropy-based algorithm, to automatically determine whether to rewrite orthography of any of the selected words. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
if a series of adjacent words in the stream has an entry in a phrase rewrite lexicon, replacing the series of adjacent words with a phrase form of the series of words from the phrase rewrite lexicon.
-
-
3. A method according to claim 1, wherein automatically replacing the word includes associating annotating linguistic tags with the orthographically rewritten form of the word.
-
4. A method according to claim 1, further comprising:
-
providing linguistic tags to selected words in the stream;
using context-sensitive rewrite rules to change the orthography of words in the stream based on their linguistic tags; and
weighting the application of these rules in specific contexts according to maximum entropy weighting.
-
-
5. A method according to claim 1, wherein at least one of the features is a context-dependent probability distribution representing a likelihood of a given word in a given context being in a given orthographic form.
-
6. A method according to claim 5, further comprising:
for each selected word, determining an orthographic rewrite probability representing a normalized product of the weighted features for that word, and if the orthographic rewrite probability is greater than a selected threshold probability, replacing that selected word with an orthographically rewritten form.
-
7. A method according to claim 1, wherein the method automatically capitalizes words in the stream of text words.
-
8. A method according to claim 1, wherein the method automatically abbreviates words in the stream of text words.
Specification