Spell checker with arbitrary length string-to-string transformations to improve noisy channel spelling correction
First Claim
Patent Images
1. A computer-implemented method for determining a likelihood that a word in a dictionary is being incorrectly represented by a string;
- comprising;
iteratively partitioning the word into multiple segments, each segment consisting of a character or character sequence, where each iteration partitions the word in to a different number of the multiple segments;
for each iteration of the partitioning, iteratively varying the lengths of the segments while maintaining the number of the segments;
for each iteration of the partitioning, dividing the string into the same number of string segments as the number of word segments and iteratively varying the lengths of the string segments, wherein corresponding word segments and string segments can be of different lengths;
for each iteration of varying the lengths of the word segments and the string segments, computing a probability for each pair, wherein each pair consists of a word segment and a corresponding string segment, and wherein the probability consists of a likelihood that the word segment is being incorrectly represented by the string segment;
for each iteration of varying the lengths, computer a product of the probabilities of the pairs; and
determining the likelihood that the word is being incorrectly represented by the string based on one of the products.
2 Assignments
0 Petitions
Accused Products
Abstract
A spell checker based on the noisy channel model has a source model and an error model. The source model determines how likely a word w in a dictionary is to have been generated. The error model determines how likely the word w was to have been incorrectly entered as the string s (e.g., mistyped or incorrectly interpreted by a speech recognition system) according to the probabilities of string-to-string edits. The string-to-string edits allow conversion of one arbitrary length character sequence to another arbitrary length character sequence.
-
Citations
42 Claims
-
1. A computer-implemented method for determining a likelihood that a word in a dictionary is being incorrectly represented by a string;
- comprising;
iteratively partitioning the word into multiple segments, each segment consisting of a character or character sequence, where each iteration partitions the word in to a different number of the multiple segments;
for each iteration of the partitioning, iteratively varying the lengths of the segments while maintaining the number of the segments;
for each iteration of the partitioning, dividing the string into the same number of string segments as the number of word segments and iteratively varying the lengths of the string segments, wherein corresponding word segments and string segments can be of different lengths;
for each iteration of varying the lengths of the word segments and the string segments, computing a probability for each pair, wherein each pair consists of a word segment and a corresponding string segment, and wherein the probability consists of a likelihood that the word segment is being incorrectly represented by the string segment;
for each iteration of varying the lengths, computer a product of the probabilities of the pairs; and
determining the likelihood that the word is being incorrectly represented by the string based on one of the products. - View Dependent Claims (2, 3, 4, 5, 6, 7)
- comprising;
-
8. A computer-implemented method comprising:
determining a probability P(s|w) expressing how likely a word w was to have been incorrectly entered as the string s based on portioning the word w and the string s and computing probabilities for various partitioning, wherein a probability for a partitioning represents the probability that one or more edit operations convert first arbitrary-length character sequences α
1, α
2, α
3, . . . , α
n in the word w to corresponding second arbitrary-length character sequences β
1,β
2,β
3, . . . β
n in the string s, wherein P(s|w)=P(α
1|β
1)* P(α
2|β
2)*P(α
|β
3)* . . . *P(α
n|β
n).- View Dependent Claims (9, 10, 11, 12, 13, 14)
-
15. A computer-implemented method comprising:
- receiving an entered string s; and
determining a probability P(s|w) expressing how likely a word w was to have been incorrectly entered as the string s, by partitioning the word w and the string s and computing probabilities for various partitionings, as follows;where Part(w) is a set of possible ways of partitioning the word w, Part(s) is a set of possible ways of partitioning the string s, R is a particular partition of the word w and T is a particular partition of the string s. - View Dependent Claims (16, 17, 18, 19, 20)
- receiving an entered string s; and
-
21. A computer-implemented method comprising:
- receiving an entered string s; and
determining a probability P(s|w) expressing how likely a word w was to have been incorrectly entered as the string s, by partitioning the word w and the string s and computing probabilities for various partitionings, as follows;where Part(w) is a set of possible ways of partitioning the word w, Part(s) is a set of possible ways of partitioning the string s, R is a particular partition of the word w and T is a particular partition of the string s. - View Dependent Claims (22, 23, 24, 25, 26, 27)
- receiving an entered string s; and
-
28. A computer-implemented method comprising:
- receiving an entered string s; and
determining a probability P(s|w) expressing how likely a word w was to have been incorrectly entered as the string s, by partitioning the word w and the string s and finding a partition R of the word w and a partition T of the string s such thatis maximized. - View Dependent Claims (29, 30, 31, 32)
- receiving an entered string s; and
-
33. A computer-implemented method for training an error model used in a spell checker, comprising:
- determining, given a <
wrong, right>
training pair and multiple single character edits that convert characters in one of the right or wrong strings to characters in the other of the right or wrong strings at differing costs, an alignment of the wrong string and the right string that results is a least cost to convert the characters;
collapsing any contiguous non-match edits into one or more common error regions, each error region containing one or more characters that can be converted to one or more other characters using a substitution edit; and
computing a probability for each substitution edit. - View Dependent Claims (34, 35, 36, 37)
- determining, given a <
-
38. A program embodied on a computer readable medium, which when executed, directs a computer to perform the following:
-
(a) receive an entered string s; (b) for multiple words w in a dictionary, determine; how likely a word w in a dictionary is to have been generated, P(w|context); and how likely the word w was to have been entered as the string s, P(s|w), based on partitioning the word w and the string s and computing probabilities for various partitionings to determine a highest likelihood of at least one edit operation that converts one of multiple character sequence of arbitrary length in the word to one of multiple character sequences of arbitrary length in the string; and (c) maximize P(s|w)*P(w|context) to identify which of the words is most likely the word intended when the string s was entered. - View Dependent Claims (39, 40, 41, 42)
-
Specification