Linguistic error detection
First Claim
Patent Images
1. A method for detecting a linguistic error within a sequence of words of a sentence, the method comprising:
- receiving, from an application executing on a computing device, a plurality of tokens each forming a distinct feature of a first sentence within an electronic document;
processing the plurality of tokens, at the computing device, to form at least one partition having at least three sequential tokens of the plurality of tokens;
calculating, at the computing device, a first numerical value defined as a probability that the at least three sequential tokens will occur in sequence in a sentence;
calculating, at the computing device, a second numerical value defined as a probability that the at least three sequential tokens will randomly occur in a sentence;
calculating, at the computing device, a third numerical value based on a comparison of the first numerical value to the second numerical value;
comparing the third numerical value to a first threshold value; and
evaluating the at least three sequential tokens as having a potential linguistic error upon the third numerical value deviating from the first threshold value.
2 Assignments
0 Petitions
Accused Products
Abstract
Potential linguistic errors within a sequence of words of a sentence are identified based on analysis of a configurable sliding window. The analysis is performed based on an assumption that if a sequence of words occurs frequently enough within a large, well-formed corpus, its joint probability for occurring in a sentence is very likely to be greater than the same words randomly ordered.
34 Citations
20 Claims
-
1. A method for detecting a linguistic error within a sequence of words of a sentence, the method comprising:
-
receiving, from an application executing on a computing device, a plurality of tokens each forming a distinct feature of a first sentence within an electronic document; processing the plurality of tokens, at the computing device, to form at least one partition having at least three sequential tokens of the plurality of tokens; calculating, at the computing device, a first numerical value defined as a probability that the at least three sequential tokens will occur in sequence in a sentence; calculating, at the computing device, a second numerical value defined as a probability that the at least three sequential tokens will randomly occur in a sentence; calculating, at the computing device, a third numerical value based on a comparison of the first numerical value to the second numerical value; comparing the third numerical value to a first threshold value; and evaluating the at least three sequential tokens as having a potential linguistic error upon the third numerical value deviating from the first threshold value. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A computing device, comprising:
-
a processing unit; and a system memory connected to the processing unit, the system memory including instructions that, when executed by the processing unit, cause the processing unit to implement an error detection module configured to detect a linguistic error within a sequence of words of a sentence, wherein the error detection module comprises; a classification module configured to; receive, from an application executing on the computing device, a plurality of tokens each forming a feature of a first sentence within an electronic document of the application; process the plurality of tokens to form at least one partition having at least three sequential tokens of the plurality of tokens; calculate a first numerical value defined as a probability that the at least three sequential tokens will occur in sequence in a sentence; calculate a second numerical value defined as a probability that the at least three sequential tokens will randomly occur in a sentence; calculate a third numerical value based on a comparison of the first numerical value to the second numerical value; compare the third numerical value to a first threshold value; and evaluate the at least three sequential tokens as having a potential linguistic error upon the third numerical value deviating from the first threshold value; and a language model module configured as a data repository to store a statistical N-gram language model formed of a structured set of conditional probabilities assigned to single words and specific sequences of words as derived from a corpus of language text, wherein numerical values used in calculating the first and second numerical values are retrieved from the language model module by the classification module. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
-
-
20. A computer readable storage device having computer-executable instructions that, when executed by a computing device, cause the computing device to perform steps comprising:
-
receiving a command configured to instantiate linguistic error detection within at least three sequential tokens of a plurality of tokens each forming a word of a first sentence within an electronic document; parsing the plurality of tokens to form at least one partition having the at least three sequential tokens; calculating a first numerical value defined as a probability that the at least three sequential tokens will occur in sequence in a sentence, wherein the first numerical value is calculated using a first equation having a form;
P(a,b,c,d,e,f, . . . )=P(a)P(b|a)P(c|a,b)P(d|a,b,c)P(e|a,b,c,d)P(f|a,b,c,d,e)=P(a)P(b|a)P(c|a,b)P(d|b,c)P(e|c,d)P(f|d,e),wherein terms having the form P(a) within the first equation correspond to a uni-gram probability, terms having the form P(b|a) within the first equation correspond to a bi-gram probability, and terms having the form P(c|a, b) within the first equation correspond to a tri-gram probability; calculating, a second numerical value defined as a probability that the at least three sequential tokens will randomly occur in a sentence, wherein the second numerical value is calculated using a second equation having a form;
Q(a,b,c,d,e,f, . . . )=P(a)P(b)P(c)P(d)P(e)P(f),wherein terms having the form P(a) within the second equation correspond to a uni-gram probability, and wherein numerical values used in calculating the first and second numerical values are retrieved from a statistical N-gram language model formed of a structured set of conditional probabilities assigned to single words and specific sequences of words as derived from a corpus of language text; calculating a third numerical value based on a comparison of the first numerical value to the second numerical value; evaluating the at least three sequential tokens as having a potential linguistic error upon the third numerical value deviating from a first threshold value and the second numerical value deviating from a second threshold value using an classification function having a form;
-
Specification