Linguistic error detection

US 9,836,447 B2
Filed: 09/16/2014
Issued: 12/05/2017
Est. Priority Date: 07/28/2011
Status: Active Grant

First Claim

Patent Images

1. A method for using a computing device to detect linguistic errors, comprising:

selecting a sequence of three or more words in a phrase;

applying a statistical language model to the selected sequence of three or more words to determine a first probability of occurrence of the selected sequence of three or more words in the phrase;

applying the statistical language model to determine a second probability of occurrence of a random ordering of the words in the selected sequence of three or more words;

calculating a numerical value by comparing the first probability of occurrence to the second probability of occurrence; and

determining that the phrase contains a linguistic error when the calculated numerical value deviates from a first predetermined threshold and the second probability of occurrence deviates from a second predetermined threshold.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Potential linguistic errors within a sequence of words of a sentence are identified based on analysis of a configurable sliding window. The analysis is performed based on an assumption that if a sequence of words occurs frequently enough within a large, well-formed corpus, its joint probability for occurring in a sentence is very likely to be greater than the same words randomly ordered.

Citations

20 Claims

1. A method for using a computing device to detect linguistic errors, comprising:
- selecting a sequence of three or more words in a phrase;
  
  applying a statistical language model to the selected sequence of three or more words to determine a first probability of occurrence of the selected sequence of three or more words in the phrase;
  
  applying the statistical language model to determine a second probability of occurrence of a random ordering of the words in the selected sequence of three or more words;
  
  calculating a numerical value by comparing the first probability of occurrence to the second probability of occurrence; and
  
  determining that the phrase contains a linguistic error when the calculated numerical value deviates from a first predetermined threshold and the second probability of occurrence deviates from a second predetermined threshold.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1 wherein the statistical language model is an N-gram language model formed of a structured set of conditional probabilities assigned to single words and to specific sequences of words as derived from a corpus of language text.
  - 3. The method of claim 1 wherein determining the first probability of occurrence further comprises:
    - evaluating a first word of the sequence of three or more words as a uni-gram probability;
      
      evaluating a second word of the sequence of three or more words as a bi-gram probability that assumes the first word immediately precedes the second word in the sequence of three or more words; and
      
      evaluating a third word of the sequence of three or more words as a tri-gram probability that assumes the first and second words occur in sequence and immediately precede the third word in the sequence of three or more words.
  - 4. The method of claim 1 wherein the sequence of three or more words in the phrase is received from an electronic document provided via an input device.
  - 5. The method of claim 4 further comprising emphasizing the sequence of three or more words within the electronic document upon determining that the phrase contains a linguistic error.
  - 6. The method of claim 4 wherein the electronic document is provided via a voice input device.
  - 7. The method of claim 4 wherein selecting the sequence of three or more words in the phrase of the electronic document is performed via a configurable sliding window that defines a number of words that will be processed to determine whether the phrase contains a linguistic error.
  - 8. The method of claim 1, further comprising receiving user specification defining the sequence of three or more words in the phrase.
  - 9. The method of claim 1 wherein calculating the numerical value comprises using at least another numerical value stored in the statistical language model.

10. A computing device, comprising:
- a processing unit; and
  
  a system memory connected to the processing unit, the system memory including instructions that, when executed by the processing unit, cause the processing unit to implement an error detection module configured to detect a linguistic error, wherein the error detection module comprises;
  
  a classification module configured to;
  
  receive, from an application executing on the computing device, a sequence of three or more words within an electronic document of the application;
  
  apply a statistical language model to the selected sequence of three or more words to determine a first probability of occurrence of the selected sequence of three or more words;
  
  apply the statistical language model to determine a second probability of occurrence of a random ordering of the words in the selected sequence of three or more words;
  
  calculate a numerical value by comparing the first probability of occurrence to the second probability of occurrence; and
  
  determine that the sequence of three or more words contains a linguistic error when the calculated numerical value deviates from a first predetermined threshold and the second probability of occurrence deviates from a second predetermined threshold.
- View Dependent Claims (11, 12, 13, 14, 15, 16)
- - 11. The computing device of claim 10 wherein the statistical language model is an N-gram language model formed of a structured set of conditional probabilities assigned to single words and to specific sequences of words as derived from a corpus of language text.
  - 12. The computing device of claim 10 further comprising highlighting the sequence of three or more words within the electronic document when it is determined that the sequence of three or more words contains a linguistic error.
  - 13. The computing device of claim 10 further comprising providing a tactile cue when it is determined that the sequence of three or more words contains a linguistic error.
  - 14. The computing device of claim 10 further comprising providing an audio cue when it is determined that the sequence of three or more words contains a linguistic error.
  - 15. The computing device of claim 10 wherein the electronic document is provided via a voice input device that captures spoken words.
  - 16. The computing device of claim 10 wherein the sequence of three or more words is selected via a configurable sliding window over the electronic document.

17. A computer readable hardware storage device having computer-executable instructions that, when executed by a computing device, cause the computing device to perform steps comprising:
- sequentially selecting sequences of three or more words within an electronic document via a configurable sliding window;
  
  applying a statistical language model to each selected sequence of three or more words to determine a first probability of occurrence of the selected sequence of three or more words;
  
  applying the statistical language model to determine a second probability of occurrence of a random ordering of the words in each selected sequence of three or more words;
  
  calculating a numerical value by comparing the first probability of occurrence to the second probability of occurrence; and
  
  determining that any of the sequences of three or more words contains a linguistic error when the calculated numerical value is less than a first predetermined threshold and the second probability of occurrence is greater than a second predetermined threshold.
- View Dependent Claims (18, 19, 20)
- - 18. The computer readable hardware storage device of claim 17 wherein the statistical language model is an N-gram language model formed of a structured set of conditional probabilities assigned to single words and to specific sequences of words as derived from a corpus of language text.
  - 19. The computer readable hardware storage device of claim 17 further comprising steps for emphasizing each sequence of three or more words that is determined to contain a linguistic error.
  - 20. The computer readable hardware storage device of claim 17 further comprising steps for receiving the electronic document via a voice input device that captures spoken words.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Inventors
Cai, Yizheng, Powell, Kevin Roland, Shahani, Ravi Chandru, Wang, Lei
Primary Examiner(s)
SPOONER, LAMONT M

Application Number

US14/488,059
Publication Number

US 20150006159A1
Time in Patent Office

1,176 Days
Field of Search

704 1, 704 9, 704 10
US Class Current
CPC Class Codes

G06F 40/232   Orthographic correction, e....

G06F 40/253   Grammatical analysis; Style...

G06F 40/263   Language identification

G06F 40/40   Processing or translation o...

Linguistic error detection

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Linguistic error detection

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links