Spell checker with arbitrary length string-to-string transformations to improve noisy channel spelling correction
First Claim
Patent Images
1. A method comprising:
- receiving an entered string;
determining how likely a word w may be incorrectly entered as a string s based on partitioning the word w and the string s;
computing probabilities for various partitionings to determine a highest likelihood of at least one edit operation that converts a first character sequence of arbitrary length in the word w to a second character sequence of arbitrary length in the string s;
implementing edit operations consisting of insertion, deletion, substitution, matching, transposition, doubling, and halving;
implementing an edit to be conditioned on a probability of a position that the edit occurs, P(α
→
β
|PSN), wherein edit operations are characterized as α
→
β
, where α
is one character sequence of zero or more characters, β
is another character sequence of zero or more characters, PSN describes positional information about a substring within the word, including the position may be a start of a word, an end of a word, or some other location within the word (PSN ={start of word, end of word, other});
wherein edits operations are not constrained or limited to a specified set of changes;
adding a start-of-word symbol and an end-of-word symbol to each word to provide the positional information; and
identifying misspelled words, wherein the misspelled words may be potentially corrected to an appropriate spelling.
1 Assignment
0 Petitions
Accused Products
Abstract
A spell checker based on the noisy channel model has a source model and an error model. The source model determines how likely a word w in a dictionary is to have been generated. The error model determines how likely the word w was to have been incorrectly entered as the string s (e.g., mistyped or incorrectly interpreted by a speech recognition system) according to the probabilities of string-to-string edits. The string-to-string edits allow conversion of one arbitrary length character sequence to another arbitrary length character sequence.
-
Citations
15 Claims
-
1. A method comprising:
-
receiving an entered string; determining how likely a word w may be incorrectly entered as a string s based on partitioning the word w and the string s; computing probabilities for various partitionings to determine a highest likelihood of at least one edit operation that converts a first character sequence of arbitrary length in the word w to a second character sequence of arbitrary length in the string s; implementing edit operations consisting of insertion, deletion, substitution, matching, transposition, doubling, and halving; implementing an edit to be conditioned on a probability of a position that the edit occurs, P(α
→
β
|PSN), wherein edit operations are characterized as α
→
β
, where α
is one character sequence of zero or more characters, β
is another character sequence of zero or more characters, PSN describes positional information about a substring within the word, including the position may be a start of a word, an end of a word, or some other location within the word (PSN ={start of word, end of word, other});wherein edits operations are not constrained or limited to a specified set of changes; adding a start-of-word symbol and an end-of-word symbol to each word to provide the positional information; and identifying misspelled words, wherein the misspelled words may be potentially corrected to an appropriate spelling. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A program embodied on a computer readable storage medium, which when executed, directs a computing device to perform spell checking, comprising:
-
a source model component of the program to determine how likely a word w in a dictionary may be generated; and an error model component of the program to determine how likely the word w may be incorrectly entered as a string s based on arbitrary length string-to-string transformations; wherein the error model component partitions the word w and the string s and computes probabilities for various partitionings based on string-to-string edits; wherein the error model component performs edit operations consisting of insertion, deletion, substitution, matching, transposition, doubling, and halving; wherein the error model component adds a start-of-word symbol and an end-of-word symbol to each word to provide a positional information; and wherein the program identifies misspelled words, and potentially corrects misspelled words. - View Dependent Claims (8, 9, 10)
-
-
11. A method for training an error model, implemented at least in part by a computing device, the method comprising:
-
providing a training set that includes correct dictionary words along with associated error words observed when entering words; producing probabilities of different arbitrary-length string-to-string corrections over a large set of training words and the associated errors when entering words; deriving probabilities on how likely a correct word may be changed to an incorrect word based on the training set; wherein the probabilities are based on a least cost way to edit an arbitrary length character sequence α
into another arbitrary length character sequence β
, showing the probabilities as α
→
β
;arranging the correct word and incorrect word according to a single letter edit and assigning different weights for the single letter edit based on Levenshtein Distance; locating a least cost alignment using the single letter edit and edit weights; performing edit operations to accommodate error profiles comprising at least one of a user-by-user basis or a group-by-group basis; and generating a trained error model to identify misspelled words. - View Dependent Claims (12, 13, 14, 15)
-
Specification