Lean parsing: a natural language processing system and method for parsing domain-specific languages

US 10,579,721 B2
Filed: 09/22/2017
Issued: 03/03/2020
Est. Priority Date: 07/15/2016
Status: Active Grant

First Claim

Patent Images

1. A computer implemented method, comprising:

receiving electronic textual data relating to a form for which one or more machine-executable form field functions needs to be determined, the electronic textual data including natural language instructions relating to the determination of one or more form field values of the form;

analyzing the electronic textual data to determine sentence data representing a plurality of separate sentences of the electronic textual data;

separating the electronic textual data into a data array formed of the sentence data of the determined plurality of separate sentences;

for each given sentence of sentence data representing sentences in the data array;

isolating segment data of one or more segments of the sentence data while relating each resulting segment to prior and succeeding segments of the sentence data, further storing the isolated segment data in one or more segment data memory locations organized to retain structure and relation of one segment to another;

for each segment of the segment data;

classifying segment data of each segment as being of a segment type of a plurality of possible segment types, discarding segment data classified as being of one or more particular predetermined segment types; and

parsing each segment data according to one or more predetermined lexicons and determining whether the segment contains one or more operators, an operator being a natural language token representing an operation that may be performed on data;

upon determining that the segment data representing the segment contains operator data representing one or more operators;

identifying all operators in the segment data representing the segment;

identifying dependency data representing one or more dependencies of the segment data associated with each identified operator;

discarding any tokens not identified as either an operator or a dependency; and

applying one or more operator-specific rules to each identified operator of the segment data to determine a first predicate structure equivalent to the original natural language text of the segment; and

upon determining that the segment data representing the segment does not contain operator data representing one or more operators;

identifying each single or multiword token in the segment data that is a predetermined token of the domain;

determining any remaining tokens of the segment that are not predetermined tokens of the domain and map the identified tokens and the remaining tokens to one or more predetermined rules, resulting in a first predicate structure for the segment data of the segment being analyzed;

mapping one or more of the first predicate structures to one or more predetermined machine-executable functions; and

implementing at least one of the mapped machine-executable functions in an electronic document preparation system.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system parses natural language in a unique way, determining important words pertaining to a text corpus of a particular genre, such as tax preparation. Sentences extracted from instructions or forms pertaining to tax preparation, for example are parsed to determine word groups forming various parts of speech, and then are processed to exclude words on an exclusion list and word groups that don'"'"'t meet predetermined criteria. From the resulting data, synonyms are replaced with a common functional operator and the resulting sentence text is analyzed against predetermined patterns to determine one or more functions to be used in a document preparation system.

78 Citations

View as Search Results

28 Claims

1. A computer implemented method, comprising:
- receiving electronic textual data relating to a form for which one or more machine-executable form field functions needs to be determined, the electronic textual data including natural language instructions relating to the determination of one or more form field values of the form;
  
  analyzing the electronic textual data to determine sentence data representing a plurality of separate sentences of the electronic textual data;
  
  separating the electronic textual data into a data array formed of the sentence data of the determined plurality of separate sentences;
  
  for each given sentence of sentence data representing sentences in the data array;
  
  isolating segment data of one or more segments of the sentence data while relating each resulting segment to prior and succeeding segments of the sentence data, further storing the isolated segment data in one or more segment data memory locations organized to retain structure and relation of one segment to another;
  
  for each segment of the segment data;
  
  classifying segment data of each segment as being of a segment type of a plurality of possible segment types, discarding segment data classified as being of one or more particular predetermined segment types; and
  
  parsing each segment data according to one or more predetermined lexicons and determining whether the segment contains one or more operators, an operator being a natural language token representing an operation that may be performed on data;
  
  upon determining that the segment data representing the segment contains operator data representing one or more operators;
  
  identifying all operators in the segment data representing the segment;
  
  identifying dependency data representing one or more dependencies of the segment data associated with each identified operator;
  
  discarding any tokens not identified as either an operator or a dependency; and
  
  applying one or more operator-specific rules to each identified operator of the segment data to determine a first predicate structure equivalent to the original natural language text of the segment; and
  
  upon determining that the segment data representing the segment does not contain operator data representing one or more operators;
  
  identifying each single or multiword token in the segment data that is a predetermined token of the domain;
  
  determining any remaining tokens of the segment that are not predetermined tokens of the domain and map the identified tokens and the remaining tokens to one or more predetermined rules, resulting in a first predicate structure for the segment data of the segment being analyzed;
  
  mapping one or more of the first predicate structures to one or more predetermined machine-executable functions; and
  
  implementing at least one of the mapped machine-executable functions in an electronic document preparation system.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The computer implemented method of claim 1, further comprising:
    - determining one or more single or multiword tokens in a segment that are a subset of a longer multiword token; and
      
      eliminating from consideration, for this segment, the determined single or multiword tokens.
  - 3. The computer implemented method of claim 1, further comprising:
    - removing, from a segment being processed, one or more tokens of the segment from further consideration due to the one or more tokens appearing in exclusion data representing an exclusion list.
  - 4. The computer implemented method of claim 3, wherein the exclusion list includes words known to have little or no importance in the domain of the data array.
  - 5. The computer implemented method of claim 1, wherein an operator includes a natural language token specified as or is a synonym of one of a set of operators comprising:
    - add;
      
      subtract;
      
      multiply;
      
      divide;
      
      less than;
      
      greater than;
      
      and;
      
      or;
      
      equal to; and
      
      not equal to.
  - 6. The computer implemented method of claim 1, wherein the segment types include at least one segment type selected from the group of segment types consisting of:
    - description;
      
      amount;
      
      instructions;
      
      condition;
      
      date;
      
      person status; and
      
      true calc.
  - 7. The computer implemented method of claim 1, further comprising:
    - filtering the sentence data to keep only tokens meeting at least one of a plurality of token tests, resulting in filtered token data.
  - 8. The computer implemented method of claim 7, wherein filtering the sentence data to keep only tokens meeting at least one of a plurality of token tests further comprises:
    - determining a part of speech of the token; and
      
      filtering the sentence data to keep only tokens having particular predetermined parts of speech characteristics, resulting in filtered token data.
  - 9. The computer implemented method of claim 8, wherein filtering the sentence data to keep only tokens having particular predetermined parts of speech characteristics, resulting in filtered token data further comprises:
    - keeping a token of the sentence data if a part of speech associated with the token is a noun.
  - 10. The computer implemented method of claim 8, wherein filtering the sentence data to keep only tokens having particular predetermined parts of speech characteristics, resulting in filtered token data further comprises:
    - keeping a token of the sentence data if a part of speech associated with the token is a verb.
  - 11. The computer implemented method of claim 1, further comprising:
    - predetermining a lexicon by;
      
      determining a frequency of appearance of one or more tokens with respect to all other tokens in a text corpus of the domain;
      
      discarding from the text corpus any tokens having a frequency of appearance below a predetermined threshold frequency, resulting in remaining tokens;
      
      examining each remaining token of the sentence data in a context of a sentence of the sentence data in which the token appears; and
      
      identifying a part of speech associated with the use of the remaining token in the context of its use within segments in which it appears.
  - 12. The computer implemented method of claim 1, further comprising:
    - replacing, within one or more sentences of the data array, tokens having similar meanings with a single word synonym of the original token.
  - 13. The computer implemented method of claim 1, further comprising:
    - generating, for a first data field of at least a first sentence of the data array, dependency data indicating one or more dependencies, wherein the dependencies include one or more of;
      
      a data value of a second data field from a second sentence of the data array;
      
      a data value of a first data field associated with a sentence of a form other than a form associated with the first data field; and
      
      a constant.
  - 14. The computer implemented method of claim 1, wherein the sentence data of the data array is associated with a new or updated tax form.

15. A system, comprising:
- one or more processors;
  
  at least one memory coupled to the one or more processors and storing instructions that, when executed by the one or more processors, cause the system to perform operations comprising;
  
  receiving electronic textual data relating to a form for which one or more machine-executable form field functions needs to be determined, the electronic textual data including natural language instructions relating to the determination of one or more form field values of the form;
  
  analyzing the electronic textual data to determine sentence data representing a plurality of separate sentences of the electronic textual data;
  
  separating the electronic textual data into a data array formed of the sentence data of the determined plurality of separate sentences;
  
  for each given sentence of sentence data representing sentences in the data array;
  
  isolating segment data of one or more segments of the sentence data while relating each resulting segment to prior and succeeding segments of the sentence data, further storing the isolated segment data in one or more segment data memory locations organized to retain structure and relation of one segment to another;
  
  for each segment of the segment data;
  
  classifying segment data of each segment as being of a segment type of a plurality of possible segment types, discarding segment data classified as being of one or more particular predetermined segment types; and
  
  parsing each segment data according to one or more predetermined lexicons and determining whether the segment contains one or more operators, an operator being a natural language token representing a mathematical operation;
  
  upon determining that the segment data representing the segment contains operator data representing one or more operators;
  
  identifying all operators in the segment data representing the segment;
  
  identifying dependency data representing one or more dependencies of the segment data associated with each identified operator;
  
  discarding any tokens not identified as either an operator or a dependency; and
  
  applying one or more operator-specific rules to each identified operator of the segment data to determine a first predicate structure equivalent to the original natural language text of the segment;
  
  upon determining that the segment data representing the segment does not contain operator data representing one or more operators;
  
  identifying each single or multiword token in the segment data that is a predetermined token of the domain;
  
  determining any remaining tokens of the segment that are not predetermined tokens of the domain and map the identified tokens and the remaining tokens to one or more predetermined rules, resulting in a first predicate structure for the segment data of the segment being analyzed;
  
  mapping one or more of the first predicate structures to one or more predetermined machine-executable functions; and
  
  implementing at least one of the mapped machine-executable functions in an electronic document preparation system.
- View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
- - 16. The system of claim 15, wherein execution of the instructions causes the system to perform operations further comprising:
    - determining one or more single or multiword tokens in a segment that are a subset of a longer multiword token; and
      
      eliminating from consideration, for this segment, the determined single or multiword tokens.
  - 17. The system of claim 15, wherein execution of the instructions causes the system to perform operations further comprising:
    - removing, from a segment being processed, one or more tokens of the segment from further consideration due to the one or more tokens appearing in exclusion data representing an exclusion list.
  - 18. The system of claim 17, wherein the exclusion list includes words known to have little or no importance in the domain of the data array.
  - 19. The system of claim 15, wherein an operator includes a natural language token specified as or is a synonym of one of a set of operators comprising:
    - add;
      
      subtract;
      
      multiply;
      
      divide;
      
      less than;
      
      greater than;
      
      and;
      
      or;
      
      equal to; and
      
      not equal to.
  - 20. The system of claim 15, wherein the segment types include at least one segment type selected from the group of segment types consisting of:
    - description;
      
      amount;
      
      instructions;
      
      condition;
      
      date;
      
      person status; and
      
      true calc.
  - 21. The system of claim 15, wherein execution of the instructions causes the system to perform operations further comprising:
    - filtering the sentence data to keep only tokens meeting at least one of a plurality of token tests, resulting in filtered token data.
  - 22. The system of claim 21, wherein filtering the sentence data to keep only tokens meeting at least one of a plurality of token tests further comprises:
    - determining a part of speech of the token; and
      
      filtering the sentence data to keep only tokens having particular predetermined parts of speech characteristics, resulting in filtered token data.
  - 23. The system of claim 22, wherein filtering the sentence data to keep only tokens having particular predetermined parts of speech characteristics, resulting in filtered token data further comprises:
    - keeping a token of the sentence data if a part of speech associated with the token is a noun.
  - 24. The system of claim 22, wherein filtering the sentence data to keep only tokens having particular predetermined parts of speech characteristics, resulting in filtered token data further comprises:
    - keeping a token of the sentence data if a part of speech associated with the token is a verb.
  - 25. The system of claim 15, wherein execution of the instructions causes the system to perform operations further comprising:
    - predetermining a lexicon by;
      
      determining a frequency of appearance of one or more tokens with respect to all other tokens in a text corpus of the domain;
      
      discarding from the text corpus any tokens having a frequency of appearance below a predetermined threshold frequency, resulting in remaining tokens;
      
      examining each remaining token of the sentence data in a context of a sentence of the sentence data in which the token appears; and
      
      identifying a part of speech associated with the use of the remaining token in the context of its use within segments in which it appears.
  - 26. The system of claim 15, wherein execution of the instructions causes the system to perform operations further comprising:
    - replacing, within one or more sentences of the data array, tokens having similar meanings with a single word synonym of the original token.
  - 27. The system of claim 15, wherein execution of the instructions causes the system to perform operations further comprising:
    - generating, for a first data field of at least a first sentence of the data array, dependency data indicating one or more dependencies, wherein the dependencies include one or more of;
      
      a data value of a second data field from a second sentence of the data array;
      
      a data value of a first data field associated with a sentence of a form other than a form associated with the first data field; and
      
      a constant.
  - 28. The system of claim 15, wherein the sentence data of the data array is associated with a new or updated tax form.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Intuit, Inc.
Original Assignee
Intuit, Inc.
Inventors
Mukherjee, Saikat, Manandise, Esme, Agarwal, Sudhir, Patchirajan, Karpaga Ganesh
Primary Examiner(s)
Fu, Hao

Application Number

US15/713,161
Publication Number

US 20180032497A1
Time in Patent Office

893 Days
Field of Search
US Class Current
CPC Class Codes

G06F 40/174   Form filling; Merging

G06F 40/205   Parsing

G06F 40/247   Thesauruses; Synonyms

G06F 40/253   Grammatical analysis; Style...

G06F 40/284   Lexical analysis, e.g. toke...

G06F 40/30   Semantic analysis

G06Q 10/10   Office automation; Time man...

G06Q 40/123   Tax preparation or submission

G06V 30/416   Extracting the logical stru...

Lean parsing: a natural language processing system and method for parsing domain-specific languages

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

78 Citations

28 Claims

Specification

Solutions

Use Cases

Quick Links

Lean parsing: a natural language processing system and method for parsing domain-specific languages

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

78 Citations

28 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links