Linguistic error detection

US 8,855,997 B2
Filed: 07/28/2011
Issued: 10/07/2014
Est. Priority Date: 07/28/2011
Status: Active Grant

First Claim

Patent Images

1. A method for detecting a linguistic error within a sequence of words of a sentence, the method comprising:

receiving, from an application executing on a computing device, a plurality of tokens each forming a distinct feature of a first sentence within an electronic document;

processing the plurality of tokens, at the computing device, to form at least one partition having at least three sequential tokens of the plurality of tokens;

calculating, at the computing device, a first numerical value defined as a probability that the at least three sequential tokens will occur in sequence in a sentence;

calculating, at the computing device, a second numerical value defined as a probability that the at least three sequential tokens will randomly occur in a sentence;

calculating, at the computing device, a third numerical value based on a comparison of the first numerical value to the second numerical value;

comparing the third numerical value to a first threshold value; and

evaluating the at least three sequential tokens as having a potential linguistic error upon the third numerical value deviating from the first threshold value.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Potential linguistic errors within a sequence of words of a sentence are identified based on analysis of a configurable sliding window. The analysis is performed based on an assumption that if a sequence of words occurs frequently enough within a large, well-formed corpus, its joint probability for occurring in a sentence is very likely to be greater than the same words randomly ordered.

34 Citations

View as Search Results

20 Claims

1. A method for detecting a linguistic error within a sequence of words of a sentence, the method comprising:
- receiving, from an application executing on a computing device, a plurality of tokens each forming a distinct feature of a first sentence within an electronic document;
  
  processing the plurality of tokens, at the computing device, to form at least one partition having at least three sequential tokens of the plurality of tokens;
  
  calculating, at the computing device, a first numerical value defined as a probability that the at least three sequential tokens will occur in sequence in a sentence;
  
  calculating, at the computing device, a second numerical value defined as a probability that the at least three sequential tokens will randomly occur in a sentence;
  
  calculating, at the computing device, a third numerical value based on a comparison of the first numerical value to the second numerical value;
  
  comparing the third numerical value to a first threshold value; and
  
  evaluating the at least three sequential tokens as having a potential linguistic error upon the third numerical value deviating from the first threshold value.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, wherein the plurality of tokens each corresponds to a word of the first sentence.
  - 3. The method of claim 2, further comprising calculating the first numerical value using an equation having a form:
    - P(a,b,c,d,e,f, . . . )=P(a)P(b|a)P(c|a,b)P(d|a,b,c)P(e|a,b,c,d)P(f|a,b,c,d,e)=P(a)P(b|a)P(c|a,b)P(d|b,c)P(e|c,d)P(f|d,e),wherein terms having the form P(a) correspond to a uni-gram probability, terms having the form P(b|a) correspond to a bi-gram probability, and terms having the form P(c|a, b) correspond to a tri-gram probability.
  - 4. The method of claim 2, further comprising calculating the second numerical value using an equation having a form:
    - Q(a,b,c,d,e,f, . . . )=P(a)P(b)P(c)P(d)P(e)P(f),wherein terms having the form P(a) correspond to a uni-gram probability.
  - 5. The method of claim 2, further comprising retrieving numerical values used in calculating the first and second numerical values from a statistical N-gram language model formed of a structured set of conditional probabilities assigned to single words and specific sequences of words as derived from a corpus of language text.
  - 6. The method of claim 5, further comprising evaluating the at least three sequential tokens as having a linguistic error upon the third numerical value deviating from the first threshold value and the second numerical value deviating from a second threshold value using a classification function having a form:
  - 7. The method of claim 1, further comprising emphasizing the at least three sequential tokens within the electronic document upon determining that the at least three sequential tokens include a potential linguistic error.
  - 8. The method of claim 7, further comprising emphasizing the at least three sequential tokens by at least one user perceivable cue selected from a group including:
    - a visual cue;
      
      an audio cue; and
      
      a tactile cue.
  - 9. The method of claim 1, further comprising receiving user specification defining a number of sequential tokens of the at least one partition.
  - 10. The method of claim 1, further comprising receiving a command prior to receiving the plurality of tokens, wherein the command is configured to instantiate linguistic error detection within the at least three sequential tokens.
  - 11. The method of claim 1, wherein the plurality of tokens each corresponds to a word of the first sentence, and wherein numerical values used in calculating the first and second numerical values are retrieved from a statistical N-gram language model formed of a structured set of conditional probabilities assigned to single words and specific sequences of words as derived from a corpus of language text.

12. A computing device, comprising:
- a processing unit; and
  
  a system memory connected to the processing unit, the system memory including instructions that, when executed by the processing unit, cause the processing unit to implement an error detection module configured to detect a linguistic error within a sequence of words of a sentence, wherein the error detection module comprises;
  
  a classification module configured to;
  
  receive, from an application executing on the computing device, a plurality of tokens each forming a feature of a first sentence within an electronic document of the application;
  
  process the plurality of tokens to form at least one partition having at least three sequential tokens of the plurality of tokens;
  
  calculate a first numerical value defined as a probability that the at least three sequential tokens will occur in sequence in a sentence;
  
  calculate a second numerical value defined as a probability that the at least three sequential tokens will randomly occur in a sentence;
  
  calculate a third numerical value based on a comparison of the first numerical value to the second numerical value;
  
  compare the third numerical value to a first threshold value; and
  
  evaluate the at least three sequential tokens as having a potential linguistic error upon the third numerical value deviating from the first threshold value; and
  
  a language model module configured as a data repository to store a statistical N-gram language model formed of a structured set of conditional probabilities assigned to single words and specific sequences of words as derived from a corpus of language text, wherein numerical values used in calculating the first and second numerical values are retrieved from the language model module by the classification module.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
- - 13. The method of claim 12, wherein the plurality of tokens each correspond to a word of the first sentence.
  - 14. The computing device of claim 13, wherein the first numerical value is calculated using an equation having a form:
    - P(a,b,c,d,e,f, . . . )=P(a)P(b|a)P(c|a,b)P(d|a,b,c)P(e|a,b,c,d)P(f|a,b,c,d,e)=P(a)P(b|a)P(c|a,b)P(d|b,c)P(e|c,d)P(f|d,e),wherein terms having the form P(a) correspond to a uni-gram probability, terms having the form P(b|a) correspond to a bi-gram probability, and terms having the form P(c|a, b) correspond to a tri-gram probability.
  - 15. The computing device of claim 13, wherein the second numerical value is calculated using an equation having a form:
    - Q(a,b,c,d,e,f, . . . )=P(a)P(b)P(c)P(d)P(e)P(f),wherein terms having the form P(a) correspond to a uni-gram probability.
  - 16. The computing device of claim 13, wherein the classification module is further configured to evaluate the at least three sequential tokens as having a linguistic error upon the third numerical value deviating from the first threshold value and the second numerical value deviating from a second threshold value using a classification function having a form:
  - 17. The computing device of claim 12, wherein the classification module is further configured emphasize at least one of the at least three sequential tokens within the electronic document upon determining that the at least three sequential tokens include a potential linguistic error.
  - 18. The computing device of claim 17, wherein emphasizing at least one of the at least three sequential tokens includes generating at least one user perceivable cue selected from a group including:
    - a visual cue;
      
      an audio cue; and
      
      a tactile cue.
  - 19. The computing device of claim 12, wherein the classification module is further configured to receive a user specification defining a number of sequential tokens of the at least one partition.

20. A computer readable storage device having computer-executable instructions that, when executed by a computing device, cause the computing device to perform steps comprising:
- receiving a command configured to instantiate linguistic error detection within at least three sequential tokens of a plurality of tokens each forming a word of a first sentence within an electronic document;
  
  parsing the plurality of tokens to form at least one partition having the at least three sequential tokens;
  
  calculating a first numerical value defined as a probability that the at least three sequential tokens will occur in sequence in a sentence, wherein the first numerical value is calculated using a first equation having a form;
  
  P(a,b,c,d,e,f, . . . )=P(a)P(b|a)P(c|a,b)P(d|a,b,c)P(e|a,b,c,d)P(f|a,b,c,d,e)=P(a)P(b|a)P(c|a,b)P(d|b,c)P(e|c,d)P(f|d,e),wherein terms having the form P(a) within the first equation correspond to a uni-gram probability, terms having the form P(b|a) within the first equation correspond to a bi-gram probability, and terms having the form P(c|a, b) within the first equation correspond to a tri-gram probability;
  
  calculating, a second numerical value defined as a probability that the at least three sequential tokens will randomly occur in a sentence, wherein the second numerical value is calculated using a second equation having a form;
  
  Q(a,b,c,d,e,f, . . . )=P(a)P(b)P(c)P(d)P(e)P(f),wherein terms having the form P(a) within the second equation correspond to a uni-gram probability, and wherein numerical values used in calculating the first and second numerical values are retrieved from a statistical N-gram language model formed of a structured set of conditional probabilities assigned to single words and specific sequences of words as derived from a corpus of language text;
  
  calculating a third numerical value based on a comparison of the first numerical value to the second numerical value;
  
  evaluating the at least three sequential tokens as having a potential linguistic error upon the third numerical value deviating from a first threshold value and the second numerical value deviating from a second threshold value using an classification function having a form;

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Cai, Yizheng, Powell, Kevin Roland, Shahani, Ravi Chandru, Wang, Lei
Primary Examiner(s)
SPOONER, LAMONT M

Application Number

US13/193,248
Publication Number

US 20130030793A1
Time in Patent Office

1,167 Days
Field of Search

704/1, 704/2, 704/8, 704/9, 704/10
US Class Current

704/9
CPC Class Codes

G06F 40/232   Orthographic correction, e....

G06F 40/253   Grammatical analysis; Style...

G06F 40/263   Language identification

G06F 40/40   Processing or translation o...

Linguistic error detection

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

34 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

Linguistic error detection

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

34 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others