Methods and systems for large-scale statistical misspelling correction
First Claim
Patent Images
1. A method implemented in a computer infrastructure, comprising:
- reviewing an input text to detect spelling errors in one or more words by detecting space-deletion errors and space-insertion errors;
calculating a variable cost distance of letters associated with the one or more words;
determining a best candidate solution for correcting the spelling errors based on the variable cost distance; and
correcting the spelling errors using the best candidate solution, wherein;
the detecting space-deletion errors comprises choosing a sequence of letters with a maximum marginal probability via an A* lattice search and m-grams probability estimation; and
the detecting space-insertion errors comprises generating all possible combinations between words in an input phrase.
1 Assignment
0 Petitions
Accused Products
Abstract
Methods and systems for large-scale statistical misspelling correction are provided. A method implemented in a computer infrastructure includes reviewing the input text to detect spelling errors in one or more words and calculating a variable cost distance of letters associated with the one or more words. Furthermore, the method can detect space-deletion errors and space-insertion errors. The method further includes determining a best candidate solution for correcting the spelling errors based on the variable cost distance. The method includes correcting the spelling errors using the best candidate solution.
44 Citations
19 Claims
-
1. A method implemented in a computer infrastructure, comprising:
-
reviewing an input text to detect spelling errors in one or more words by detecting space-deletion errors and space-insertion errors; calculating a variable cost distance of letters associated with the one or more words; determining a best candidate solution for correcting the spelling errors based on the variable cost distance; and correcting the spelling errors using the best candidate solution, wherein; the detecting space-deletion errors comprises choosing a sequence of letters with a maximum marginal probability via an A* lattice search and m-grams probability estimation; and the detecting space-insertion errors comprises generating all possible combinations between words in an input phrase. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A computer program product comprising a tangible computer readable storage medium having readable program code embodied in the tangible computer readable storage medium, the computer program product includes at least one component operable to:
-
review an input text to detect spelling errors in one or more words; calculate a variable Damerau-Levenshtein cost distance of letters associated with the one or more words; determine a best candidate solution for correcting the spelling errors based on the variable Damerau-Levenshtein cost distance using an A* lattice search and an m-grams probability estimation; and correct the spelling errors using the best candidate solution. - View Dependent Claims (14, 15, 16, 17)
-
-
18. A computer system for tokenizing multilingual textual documents, the system comprising:
-
a CPU, a computer readable memory and a tangible computer readable storage media; first program instructions to review an input text to detect spelling errors in one or more words; second program instruction to calculate a variable cost distance of letters associated with the one or more words; third program instructions to determine a best candidate solution for correcting the spelling errors based on the variable cost distance using an A* lattice search and an m-grams probability estimation; and fourth program instructions to correct the spelling errors using the best candidate solution, wherein; the first, second, third, and fourth program instructions are stored on the tangible computer readable storage media for execution by the CPU via the computer readable memory; the calculating the variable cost distance comprises utilizing a Damerau-Levenshtein distance to determine a cost of correcting the spelling errors, wherein letters having similar shapes, similar pronunciation, or keyboard proximity are given a low variable cost distance; and the determining the best candidate solution comprises comparing the spelling errors with all words in a dictionary using the variable cost distance and selecting one or more words having a minimum variable cost distance. - View Dependent Claims (19)
-
Specification