Deep learning approach to grammatical correction for incomplete parses

US 10,740,555 B2
Filed: 12/07/2017
Issued: 08/11/2020
Est. Priority Date: 12/07/2017
Status: Active Grant

First Claim

Patent Images

1. A method, comprising:

determining, by a parser executing on a processor, that a parse of an input string comprising a plurality of tokens is incomplete;

generating, based on a machine learning (ML) model;

(i) a plurality of candidate addition tokens for adding to the input string, and (ii) a plurality of candidate removal tokens for removing from the input string, comprising, for a first token of the plurality of tokens;

identifying a second token of the plurality of tokens, wherein the second token is immediately subsequent to the first token in the input string;

processing the first and second tokens using the ML model to generate a potential new token to be inserted between the first and second tokens without removing either the first or second token from the input string;

identifying a third token of the plurality of tokens, wherein the third token is immediately subsequent to the second token in the input string; and

processing the first and third tokens using the ML model to generate a potential removal token indicating a confidence that the second token should be removed from the input string;

selecting, from the plurality of candidate addition tokens and the plurality of candidate removal tokens, a first candidate token; and

modifying the input string based on the first candidate token to facilitate a complete parse of the modified input string by the parser.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Performing an operation comprising determining that a parse of an input string comprising a plurality of tokens is incomplete, generating, based on a machine learning (ML) model: (i) a plurality of candidate addition tokens for adding to the input string, and (ii) a plurality of candidate removal tokens for removing from the input string, selecting, from the plurality of candidate addition tokens and the plurality of candidate removal tokens, a first candidate token, and modifying the input string based on the first candidate token to facilitate a complete parse of the modified input string by a parser.

25 Citations

20 Claims

1. A method, comprising:
- determining, by a parser executing on a processor, that a parse of an input string comprising a plurality of tokens is incomplete;
  
  generating, based on a machine learning (ML) model;
  
  (i) a plurality of candidate addition tokens for adding to the input string, and (ii) a plurality of candidate removal tokens for removing from the input string, comprising, for a first token of the plurality of tokens;
  
  identifying a second token of the plurality of tokens, wherein the second token is immediately subsequent to the first token in the input string;
  
  processing the first and second tokens using the ML model to generate a potential new token to be inserted between the first and second tokens without removing either the first or second token from the input string;
  
  identifying a third token of the plurality of tokens, wherein the third token is immediately subsequent to the second token in the input string; and
  
  processing the first and third tokens using the ML model to generate a potential removal token indicating a confidence that the second token should be removed from the input string;
  
  selecting, from the plurality of candidate addition tokens and the plurality of candidate removal tokens, a first candidate token; and
  
  modifying the input string based on the first candidate token to facilitate a complete parse of the modified input string by the parser.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein the first candidate token comprises a first candidate addition token of the plurality of candidate addition tokens, wherein the first candidate token is selected based on a confidence score for the first candidate token exceeding a respective confidence score for each of the plurality of candidate addition tokens and each of the plurality of candidate removal tokens, wherein modifying the input string comprises:
    - identifying a first and a second token of the plurality of tokens associated with the first candidate addition token; and
      
      inserting the first candidate addition token between the first token and the second token in the input string.
  - 3. The method of claim 1, wherein the first candidate token comprises a first candidate removal token of the plurality of candidate removal tokens, wherein the first candidate token is selected based on a confidence score for the first candidate token exceeding a respective confidence score for each of the plurality of candidate addition tokens and each of the plurality of candidate removal tokens, wherein modifying the input string comprises:
    - identifying a first and a second token of the plurality of tokens associated with the first candidate removal token; and
      
      removing the first candidate removal token from the input string, wherein the first candidate removal token is disposed between the first and second tokens in the unmodified input string.
  - 4. The method of claim 1, wherein each of the plurality of candidate addition tokens is associated with a respective confidence score that exceeds an addition threshold, wherein each of the plurality of candidate removal tokens is associated with a respective confidence score that exceeds a removal threshold.
  - 5. The method of claim 1, further comprising:
    - determining that the parser can parse the modified input string; and
      
      parsing the modified input string by the parser to generate a parse tree representing the modified input string.
  - 6. The method of claim 1, further comprising:
    - determining that the parser cannot parse the modified input string;
      
      selecting, from the plurality of candidate addition tokens and the plurality of candidate removal tokens, a second candidate token;
      
      modifying the input string based on the second candidate token; and
      
      determining that the parser can parse the input string modified based on the second candidate token.
  - 7. The method of claim 1, further comprising prior to determining the parse of the input string is incomplete:
    - receiving a training corpus comprising a plurality of strings;
      
      training the ML model based on the training corpus, wherein the ML model learns, for each respective token in the training corpus, a respective plurality of weights relative to each respective pair of tokens in the training corpus, wherein each of the plurality of weights indicates either (i) a confidence that the respective token should be inserted between the respective pair of tokens, or (ii) a confidence that the respective token should be removed from between the respective pair of tokens.

8. A computer program product, comprising:
- a non-transitory computer-readable storage medium having computer readable program code embodied therewith, the computer readable program code executable by a processor to perform an operation comprising;
  
  determining that a parse of an input string comprising a plurality of tokens is incomplete;
  
  generating, based on a machine learning (ML) model;
  
  (i) a plurality of candidate addition tokens for adding to the input string, and (ii) a plurality of candidate removal tokens for removing from the input string comprising, for a first token of the plurality of tokens;
  
  identifying a second token of the plurality of tokens, wherein the second token is immediately subsequent to the first token in the input string;
  
  processing the first and second tokens using the ML model to generate a potential new token to be inserted between the first and second tokens without removing either the first or second token from the input string;
  
  identifying a third token of the plurality of tokens, wherein the third token is immediately subsequent to the second token in the input string; and
  
  processing the first and third tokens using the ML model to generate a potential removal token indicating a confidence that the second token should be removed from the input string;
  
  selecting, from the plurality of candidate addition tokens and the plurality of candidate removal tokens, a first candidate token; and
  
  modifying the input string based on the first candidate token to facilitate a complete parse of the modified input string by a parser.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The computer program product of claim 8, wherein the first candidate token comprises a first candidate addition token of the plurality of candidate addition tokens, wherein the first candidate token is selected based on a confidence score for the first candidate token exceeding a respective confidence score for each of the plurality of candidate addition tokens and each of the plurality of candidate removal tokens, wherein modifying the input string comprises:
    - identifying a first and a second token of the plurality of tokens associated with the first candidate addition token; and
      
      inserting the first candidate addition token between the first token and the second token in the input string.
  - 10. The computer program product of claim 8, wherein the first candidate token comprises a first candidate removal token of the plurality of candidate removal tokens, wherein the first candidate token is selected based on a confidence score for the first candidate token exceeding a respective confidence score for each of the plurality of candidate addition tokens and each of the plurality of candidate removal tokens, wherein modifying the input string comprises:
    - identifying a first and a second token of the plurality of tokens associated with the first candidate removal token; and
      
      removing the first candidate removal token from the input string, wherein the first candidate removal token is disposed between the first and second tokens in the unmodified input string.
  - 11. The computer program product of claim 8, wherein each of the plurality of candidate addition tokens is associated with a respective confidence score that exceeds an addition threshold, wherein each of the plurality of candidate removal tokens is associated with a respective confidence score that exceeds a removal threshold.
  - 12. The computer program product of claim 8, the operation further comprising:
    - determining that the parser can parse the modified input string; and
      
      parsing the modified input string by the parser to generate a parse tree representing the modified input string.
  - 13. The computer program product of claim 8, the operation further comprising:
    - determining that the parser cannot parse the modified input string;
      
      selecting, from the plurality of candidate addition tokens and the plurality of candidate removal tokens, a second candidate token;
      
      modifying the input string based on the second candidate token; and
      
      determining that the parser can parse the input string modified based on the second candidate token.
  - 14. The computer program product of claim 8, the operation further comprising prior to determining the parse of the input string is incomplete:
    - receiving a training corpus comprising a plurality of strings;
      
      training the ML model based on the training corpus, wherein the ML model learns, for each respective token in the training corpus, a respective plurality of weights relative to each respective pair of tokens in the training corpus, wherein each of the plurality of weights indicates either (i) a confidence that the respective token should be inserted between the respective pair of tokens, or (ii) a confidence that the respective token should be removed from between the respective pair of tokens.

15. A system, comprising:
- a processor; and
  
  a memory storing one or more instructions which, when executed by the processor, performs an operation comprising;
  
  determining that a parse of an input string comprising a plurality of tokens is incomplete;
  
  generating, based on a machine learning (ML) model;
  
  (i) a plurality of candidate addition tokens for adding to the input string, and (ii) a plurality of candidate removal tokens for removing from the input string comprising, for a first token of the plurality of tokens;
  
  identifying a second token of the plurality of tokens, wherein the second token is immediately subsequent to the first token in the input string;
  
  processing the first and second tokens using the ML model to generate a potential new token to be inserted between the first and second tokens without removing either the first or second token from the input string;
  
  identifying a third token of the plurality of tokens, wherein the third token is immediately subsequent to the second token in the input string; and
  
  processing the first and third tokens using the ML model to generate a potential removal token indicating a confidence that the second token should be removed from the input string;
  
  selecting, from the plurality of candidate addition tokens and the plurality of candidate removal tokens, a first candidate token; and
  
  modifying the input string based on the first candidate token to facilitate a complete parse of the modified input string by a parser.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The system of claim 15, wherein the first candidate token comprises a first candidate addition token of the plurality of candidate addition tokens, wherein the first candidate token is selected based on a confidence score for the first candidate token exceeding a respective confidence score for each of the plurality of candidate addition tokens and each of the plurality of candidate removal tokens, wherein modifying the input string comprises:
    - identifying a first and a second token of the plurality of tokens associated with the first candidate addition token; and
      
      inserting the first candidate addition token between the first token and the second token in the input string.
  - 17. The system of claim 15, wherein the first candidate token comprises a first candidate removal token of the plurality of candidate removal tokens, wherein the first candidate token is selected based on a confidence score for the first candidate token exceeding a respective confidence score for each of the plurality of candidate addition tokens and each of the plurality of candidate removal tokens, wherein modifying the input string comprises:
    - identifying a first and a second token of the plurality of tokens associated with the first candidate removal token; and
      
      removing the first candidate removal token from the input string, wherein the first candidate removal token is disposed between the first and second tokens in the unmodified input string.
  - 18. The system of claim 15, wherein each of the plurality of candidate addition tokens is associated with a respective confidence score that exceeds an addition threshold, wherein each of the plurality of candidate removal tokens is associated with a respective confidence score that exceeds a removal threshold.
  - 19. The system of claim 15, the operation further comprising:
    - determining that the parser can parse the modified input string; and
      
      parsing the modified input string by the parser to generate a parse tree representing the modified input string.
  - 20. The system of claim 15, the operation further comprising:
    - determining that the parser cannot parse the modified input string;
      
      selecting, from the plurality of candidate addition tokens and the plurality of candidate removal tokens, a second candidate token;
      
      modifying the input string based on the second candidate token; and
      
      determining that the parser can parse the input string modified based on the second candidate token.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Ezen Can, Aysu, Delima, Roberto, Contreras, David, Allen, Corville O.
Primary Examiner(s)
Shin, Seong-Ah A

Application Number

US15/834,640
Publication Number

US 20190179887A1
Time in Patent Office

978 Days
Field of Search

704 9
US Class Current
CPC Class Codes

G06F 40/205   Parsing

G06F 40/211   Syntactic parsing, e.g. bas...

G06F 40/216   using statistical methods

G06F 40/253   Grammatical analysis; Style...

G06F 40/284   Lexical analysis, e.g. toke...

G06N 20/00   Machine learning

G06N 7/00   Computing arrangements base...

Deep learning approach to grammatical correction for incomplete parses

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

25 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Deep learning approach to grammatical correction for incomplete parses

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

25 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links