Spell checker with arbitrary length string-to-string transformations to improve noisy channel spelling correction

US 7,047,493 B1
Filed: 03/31/2000
Issued: 05/16/2006
Est. Priority Date: 03/31/2000
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented method for determining a likelihood that a word in a dictionary is being incorrectly represented by a string;

comprising;

iteratively partitioning the word into multiple segments, each segment consisting of a character or character sequence, where each iteration partitions the word in to a different number of the multiple segments;

for each iteration of the partitioning, iteratively varying the lengths of the segments while maintaining the number of the segments;

for each iteration of the partitioning, dividing the string into the same number of string segments as the number of word segments and iteratively varying the lengths of the string segments, wherein corresponding word segments and string segments can be of different lengths;

for each iteration of varying the lengths of the word segments and the string segments, computing a probability for each pair, wherein each pair consists of a word segment and a corresponding string segment, and wherein the probability consists of a likelihood that the word segment is being incorrectly represented by the string segment;

for each iteration of varying the lengths, computer a product of the probabilities of the pairs; and

determining the likelihood that the word is being incorrectly represented by the string based on one of the products.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A spell checker based on the noisy channel model has a source model and an error model. The source model determines how likely a word w in a dictionary is to have been generated. The error model determines how likely the word w was to have been incorrectly entered as the string s (e.g., mistyped or incorrectly interpreted by a speech recognition system) according to the probabilities of string-to-string edits. The string-to-string edits allow conversion of one arbitrary length character sequence to another arbitrary length character sequence.

Citations

42 Claims

1. A computer-implemented method for determining a likelihood that a word in a dictionary is being incorrectly represented by a string;
- comprising;
  
  iteratively partitioning the word into multiple segments, each segment consisting of a character or character sequence, where each iteration partitions the word in to a different number of the multiple segments;
  
  for each iteration of the partitioning, iteratively varying the lengths of the segments while maintaining the number of the segments;
  
  for each iteration of the partitioning, dividing the string into the same number of string segments as the number of word segments and iteratively varying the lengths of the string segments, wherein corresponding word segments and string segments can be of different lengths;
  
  for each iteration of varying the lengths of the word segments and the string segments, computing a probability for each pair, wherein each pair consists of a word segment and a corresponding string segment, and wherein the probability consists of a likelihood that the word segment is being incorrectly represented by the string segment;
  
  for each iteration of varying the lengths, computer a product of the probabilities of the pairs; and
  
  determining the likelihood that the word is being incorrectly represented by the string based on one of the products.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. A method as recited in claim 1, wherein a character sequence in the word has a first number of multiple characters and a character sequence in the string has a second number of multiple characters that is different from the first number of multiple characters.
  - 3. A method as recited in claim 1 and further comprising determining how likely the word is to have been generated.
  - 4. A method as recited in claim 1 and further comprising conditioning an edit operation that changes the string into the word on at least one of the probabilities.
  - 5. A method as recited in claim 1 and further comprising identifying the string as potentially incorrect.
  - 6. A method as recited in claim 1 and further comprising correcting the string to the word.
  - 7. A computer readable medium having computer-executable instructions that, when executed on a processor, perform the method as recited in claim 1.

8. A computer-implemented method comprising:
- determining a probability P(s|w) expressing how likely a word w was to have been incorrectly entered as the string s based on portioning the word w and the string s and computing probabilities for various partitioning, wherein a probability for a partitioning represents the probability that one or more edit operations convert first arbitrary-length character sequences α
  
  ₁, α
  
  ₂, α
  
  ₃, . . . , α
  
  _nin the word w to corresponding second arbitrary-length character sequences β
  
  ₁,β
  
  ₂,β
  
  ₃, . . . β
  
  _nin the string s, wherein P(s|w)=P(α
  
  ₁|β
  
  ₁)* P(α
  
  ₂|β
  
  ₂)*P(α
  
  |β
  
  ₃)* . . . *P(α
  
  _n|β
  
  _n).
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. A method as recited in claim 8, wherein lengths of corresponding first and second character sequences are different.
  - 10. A method as recited in claim 8 and further comprising determining how likely the word w is to have been generated.
  - 11. A method as recited in claim 8 and further comprising conditioning the edit operations on positions that the edits occur at within the word.
  - 12. A method as recited in claim 8 and further comprising correcting the string s to the word w.
  - 13. A method as recited in claim 8 and further comprising identifying the string s as potentially incorrect.
  - 14. A computer readable medium having computer-executable instructions that, when executed on a processor, perform the method as recited in claim 8.

15. A computer-implemented method comprising:
- receiving an entered string s; and
  
  determining a probability P(s|w) expressing how likely a word w was to have been incorrectly entered as the string s, by partitioning the word w and the string s and computing probabilities for various partitionings, as follows;
  
  $P (s ❘ w) = \sum_{R \in Part (w)} P (R ❘ w) \sum_{\underset{\langle T \rangle = \langle R \rangle}{T \in Part (s)}} \prod_{i = 1}^{\langle R \rangle} P (T_{i} ❘ R_{i})$ where Part(w) is a set of possible ways of partitioning the word w, Part(s) is a set of possible ways of partitioning the string s, R is a particular partition of the word w and T is a particular partition of the string s.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. A method as recited in claim 15 and further comprising selecting the partition that returns a highest probability.
  - 17. A method as recited in claim 15 and further comprising determining how likely the word w is to have been generated.
  - 18. A method as recited in claim 15 and further comprising correcting the string s to the word w.
  - 19. A method as recited in claim 15 and further comprising identifying the string s as potentially incorrect.
  - 20. A computer readable medium having computer-executable instructions that, when executed on a processor, perform the method as recited in claim 15.

21. A computer-implemented method comprising:
- receiving an entered string s; and
  
  determining a probability P(s|w) expressing how likely a word w was to have been incorrectly entered as the string s, by partitioning the word w and the string s and computing probabilities for various partitionings, as follows;
  
  $P (s ❘ w) = \max_{R \in Part (w), T \in Part (s)} P (R ❘ w) * \prod_{i = 1}^{\langle R \rangle} P (T_{i} ❘ R_{i})$ where Part(w) is a set of possible ways of partitioning the word w, Part(s) is a set of possible ways of partitioning the string s, R is a particular partition of the word w and T is a particular partition of the string s.
- View Dependent Claims (22, 23, 24, 25, 26, 27)
- - 22. A method as recited in claim 21 and further comprising omitting the term P(R|w) from the computation of P(s|w).
  - 23. A method as recited in claim 21 and further comprising setting terms P(T_i|R_i)=1 whenever T_i=R_i.
  - 24. A method as recited in claim 21 and further comprising determining how likely the word w is to have been generated.
  - 25. A method as recited in claim 21 and further comprising correcting the string s to the word w.
  - 26. A method as recited in claim 21 and further comprising identifying the string s as potentially incorrect.
  - 27. A computer readable medium having computer-executable instructions that, when executed on a processor, perform the method as recited in claim 21.

28. A computer-implemented method comprising:
- receiving an entered string s; and
  
  determining a probability P(s|w) expressing how likely a word w was to have been incorrectly entered as the string s, by partitioning the word w and the string s and finding a partition R of the word w and a partition T of the string s such that $\prod_{i = 1}^{\langle R \rangle} P (T_{i} ❘ R_{i})$ is maximized.
- View Dependent Claims (29, 30, 31, 32)
- - 29. A method as recited in claim 28 and further comprising determining how likely the word w is to have been generated.
  - 30. A method as recited in claim 28 and further comprising correcting the string s to the word w.
  - 31. A method as recited in claim 28 and further comprising identifying the string s as potentially incorrect.
  - 32. A computer readable medium having computer-executable instructions that, when executed on a processor, perform the method as recited in claim 28.

33. A computer-implemented method for training an error model used in a spell checker, comprising:
- determining, given a <
  
  wrong, right>
  
  training pair and multiple single character edits that convert characters in one of the right or wrong strings to characters in the other of the right or wrong strings at differing costs, an alignment of the wrong string and the right string that results is a least cost to convert the characters;
  
  collapsing any contiguous non-match edits into one or more common error regions, each error region containing one or more characters that can be converted to one or more other characters using a substitution edit; and
  
  computing a probability for each substitution edit.
- View Dependent Claims (34, 35, 36, 37)
- - 34. A method as recited in claim 33, wherein the assigning comprises assessing a cost of 0 to all match edits and a cost of 1 to all non-match edits.
  - 35. A method as recited in claim 33, wherein the single character edits comprises insertion, deletion, and substitution.
  - 36. A method as recited in claim 33, further comprising collecting multiple <
    - wrong, right>
      
      training pairs from online resources.
  - 37. A method as recited in claim 33, further comprising expanding each of the error regions to capture at least one character on at least one side of the error region.

38. A program embodied on a computer readable medium, which when executed, directs a computer to perform the following:
- (a) receive an entered string s;
  
  (b) for multiple words w in a dictionary, determine;
  
  how likely a word w in a dictionary is to have been generated, P(w|context); and
  
  how likely the word w was to have been entered as the string s, P(s|w), based on partitioning the word w and the string s and computing probabilities for various partitionings to determine a highest likelihood of at least one edit operation that converts one of multiple character sequence of arbitrary length in the word to one of multiple character sequences of arbitrary length in the string; and
  
  (c) maximize P(s|w)*P(w|context) to identify which of the words is most likely the word intended when the string s was entered.
- View Dependent Claims (39, 40, 41, 42)
- - 39. A program as recited in claim 38, wherein the determination (b) is performed for all words in the dictionary.
  - 40. A program as recited in claim 38, further comprising computer-executable instructions that directs a computer to perform one of leaving the string unchanged, autocorrect the string into the word, or offer a list of possible corrections.
  - 41. A spell checker program, embodied on a computer-readable medium, comprising the program of claim 38.
  - 42. A language conversion program, embodied on a computer-readable medium, comprising the program of claim 38.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Brill, Eric D., Moore, Robert C.
Primary Examiner(s)
STORK, KYLE R
Assistant Examiner(s)
Schlaifer, Jonathan

Application Number

US09/539,357
Time in Patent Office

2,237 Days
Field of Search

715/533, 715/532, 715/513, 707/795
US Class Current

715/257
CPC Class Codes

G06F 40/232 Orthographic correction, e....

G10L 15/183 using context dependencies,...

Spell checker with arbitrary length string-to-string transformations to improve noisy channel spelling correction

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

42 Claims

Specification

Solutions

Use Cases

Quick Links

Spell checker with arbitrary length string-to-string transformations to improve noisy channel spelling correction

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

42 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links