Spell checker with arbitrary length string-to-string transformations to improve noisy channel spelling correction

US 7,366,983 B2
Filed: 07/15/2005
Issued: 04/29/2008
Est. Priority Date: 03/31/2000
Status: Expired due to Fees

First Claim

Patent Images

1. A method comprising:

receiving an entered string;

determining how likely a word w may be incorrectly entered as a string s based on partitioning the word w and the string s;

computing probabilities for various partitionings to determine a highest likelihood of at least one edit operation that converts a first character sequence of arbitrary length in the word w to a second character sequence of arbitrary length in the string s;

implementing edit operations consisting of insertion, deletion, substitution, matching, transposition, doubling, and halving;

implementing an edit to be conditioned on a probability of a position that the edit occurs, P(α

→

β

|PSN), wherein edit operations are characterized as α

→

β

, where α

is one character sequence of zero or more characters, β

is another character sequence of zero or more characters, PSN describes positional information about a substring within the word, including the position may be a start of a word, an end of a word, or some other location within the word (PSN ={start of word, end of word, other});

wherein edits operations are not constrained or limited to a specified set of changes;

adding a start-of-word symbol and an end-of-word symbol to each word to provide the positional information; and

identifying misspelled words, wherein the misspelled words may be potentially corrected to an appropriate spelling.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A spell checker based on the noisy channel model has a source model and an error model. The source model determines how likely a word w in a dictionary is to have been generated. The error model determines how likely the word w was to have been incorrectly entered as the string s (e.g., mistyped or incorrectly interpreted by a speech recognition system) according to the probabilities of string-to-string edits. The string-to-string edits allow conversion of one arbitrary length character sequence to another arbitrary length character sequence.

Citations

15 Claims

1. A method comprising:
- receiving an entered string;
  
  determining how likely a word w may be incorrectly entered as a string s based on partitioning the word w and the string s;
  
  computing probabilities for various partitionings to determine a highest likelihood of at least one edit operation that converts a first character sequence of arbitrary length in the word w to a second character sequence of arbitrary length in the string s;
  
  implementing edit operations consisting of insertion, deletion, substitution, matching, transposition, doubling, and halving;
  
  implementing an edit to be conditioned on a probability of a position that the edit occurs, P(α
  
  →
  
  β
  
  |PSN), wherein edit operations are characterized as α
  
  →
  
  β
  
  , where α
  
  is one character sequence of zero or more characters, β
  
  is another character sequence of zero or more characters, PSN describes positional information about a substring within the word, including the position may be a start of a word, an end of a word, or some other location within the word (PSN ={start of word, end of word, other});
  
  wherein edits operations are not constrained or limited to a specified set of changes;
  
  adding a start-of-word symbol and an end-of-word symbol to each word to provide the positional information; and
  
  identifying misspelled words, wherein the misspelled words may be potentially corrected to an appropriate spelling.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method as recited in claim 1, further comprising determining how likely the word w may be generated.
  - 3. The method as recited in claim 1, further comprising conditioning an edit operation that changes the string into the word on at least one of the probabilities.
  - 4. The method as recited in claim 1, further comprising identifying the string as potentially incorrect.
  - 5. The method as recited in claim 1, further comprising correcting the string to the word.
  - 6. A computer readable medium having computer-executable instructions that, when executed on a processor, perform the method as recited in claim 1.

7. A program embodied on a computer readable storage medium, which when executed, directs a computing device to perform spell checking, comprising:
- a source model component of the program to determine how likely a word w in a dictionary may be generated; and
  
  an error model component of the program to determine how likely the word w may be incorrectly entered as a string s based on arbitrary length string-to-string transformations;
  
  wherein the error model component partitions the word w and the string s and computes probabilities for various partitionings based on string-to-string edits;
  
  wherein the error model component performs edit operations consisting of insertion, deletion, substitution, matching, transposition, doubling, and halving;
  
  wherein the error model component adds a start-of-word symbol and an end-of-word symbol to each word to provide a positional information; and
  
  wherein the program identifies misspelled words, and potentially corrects misspelled words.
- View Dependent Claims (8, 9, 10)
- - 8. A program embodied on a computer readable storage medium as recited in claim 7, wherein the string-to-string transformations comprise converting a first character sequence of a first length into a second character sequence of a second length that is different than the first length.
  - 9. A program embodied on a computer readable storage medium as recited in claim 7, wherein the string-to-string transformations comprise converting a first character sequence with multiple characters into a second character sequence with multiple characters.
  - 10. A program embodied on a computer readable storage medium as recited in claim 7, wherein the string-to-string transformations comprise converting a first character sequence having a first number of multiple characters into a second character sequence having a second number of multiple characters that is different from the first number of multiple characters.

11. A method for training an error model, implemented at least in part by a computing device, the method comprising:
- providing a training set that includes correct dictionary words along with associated error words observed when entering words;
  
  producing probabilities of different arbitrary-length string-to-string corrections over a large set of training words and the associated errors when entering words;
  
  deriving probabilities on how likely a correct word may be changed to an incorrect word based on the training set;
  
  wherein the probabilities are based on a least cost way to edit an arbitrary length character sequence α
  
  into another arbitrary length character sequence β
  
  , showing the probabilities as α
  
  →
  
  β
  
  ;
  
  arranging the correct word and incorrect word according to a single letter edit and assigning different weights for the single letter edit based on Levenshtein Distance;
  
  locating a least cost alignment using the single letter edit and edit weights;
  
  performing edit operations to accommodate error profiles comprising at least one of a user-by-user basis or a group-by-group basis; and
  
  generating a trained error model to identify misspelled words.
- View Dependent Claims (12, 13, 14, 15)
- - 12. The method as recited in claim 11, wherein the single letter edit comprises at least one of an insertion, a substitution, a deletion, or a match.
  - 13. The method as recited in claim 12, wherein the different weights for the single letter edit comprises a weight of 1 each for the insertion, the substitution, the deletion, and a weight of 0 for the match.
  - 14. The method as recited in claim 11, further comprising running the error model to auto-correct string s into word w and saving <
    - s,w>
      
      tuples to retrain the error model.
  - 15. The method as recited in claim 14, further comprising retraining the error model over a collection of on-line resources.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Moore, Robert, Brill, Eric
Primary Examiner(s)
Wiley; David A.
Assistant Examiner(s)
Parker; Brandon

Application Number

US11/182,214
Publication Number

US 20050257147A1
Time in Patent Office

1,019 Days
Field of Search

715/533, 715/513, 715/532, 715/534, 704/10, 704/9
US Class Current

715/257
CPC Class Codes

G06F 40/232 Orthographic correction, e....

G10L 15/183 using context dependencies,...

Spell checker with arbitrary length string-to-string transformations to improve noisy channel spelling correction

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Spell checker with arbitrary length string-to-string transformations to improve noisy channel spelling correction

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links