Lean parsing: a natural language processing system and method for parsing domain-specific languages
First Claim
Patent Images
1. A computer implemented method, comprising:
- receiving electronic textual data relating to a form for which one or more machine-executable form field functions needs to be determined, the electronic textual data including natural language instructions relating to the determination of one or more form field values of the form;
analyzing the electronic textual data to determine sentence data representing a plurality of separate sentences of the electronic textual data;
separating the electronic textual data into a data array formed of the sentence data of the determined plurality of separate sentences;
for each given sentence of sentence data representing sentences in the data array;
isolating segment data of one or more segments of the sentence data while relating each resulting segment to prior and succeeding segments of the sentence data, further storing the isolated segment data in one or more segment data memory locations organized to retain structure and relation of one segment to another;
for each segment of the segment data;
classifying segment data of each segment as being of a segment type of a plurality of possible segment types, discarding segment data classified as being of one or more particular predetermined segment types; and
parsing each segment data according to one or more predetermined lexicons and determining whether the segment contains one or more operators, an operator being a natural language token representing an operation that may be performed on data;
upon determining that the segment data representing the segment contains operator data representing one or more operators;
identifying all operators in the segment data representing the segment;
identifying dependency data representing one or more dependencies of the segment data associated with each identified operator;
discarding any tokens not identified as either an operator or a dependency; and
applying one or more operator-specific rules to each identified operator of the segment data to determine a first predicate structure equivalent to the original natural language text of the segment; and
upon determining that the segment data representing the segment does not contain operator data representing one or more operators;
identifying each single or multiword token in the segment data that is a predetermined token of the domain;
determining any remaining tokens of the segment that are not predetermined tokens of the domain and map the identified tokens and the remaining tokens to one or more predetermined rules, resulting in a first predicate structure for the segment data of the segment being analyzed;
mapping one or more of the first predicate structures to one or more predetermined machine-executable functions; and
implementing at least one of the mapped machine-executable functions in an electronic document preparation system.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and system parses natural language in a unique way, determining important words pertaining to a text corpus of a particular genre, such as tax preparation. Sentences extracted from instructions or forms pertaining to tax preparation, for example are parsed to determine word groups forming various parts of speech, and then are processed to exclude words on an exclusion list and word groups that don'"'"'t meet predetermined criteria. From the resulting data, synonyms are replaced with a common functional operator and the resulting sentence text is analyzed against predetermined patterns to determine one or more functions to be used in a document preparation system.
78 Citations
28 Claims
-
1. A computer implemented method, comprising:
-
receiving electronic textual data relating to a form for which one or more machine-executable form field functions needs to be determined, the electronic textual data including natural language instructions relating to the determination of one or more form field values of the form; analyzing the electronic textual data to determine sentence data representing a plurality of separate sentences of the electronic textual data; separating the electronic textual data into a data array formed of the sentence data of the determined plurality of separate sentences; for each given sentence of sentence data representing sentences in the data array; isolating segment data of one or more segments of the sentence data while relating each resulting segment to prior and succeeding segments of the sentence data, further storing the isolated segment data in one or more segment data memory locations organized to retain structure and relation of one segment to another; for each segment of the segment data; classifying segment data of each segment as being of a segment type of a plurality of possible segment types, discarding segment data classified as being of one or more particular predetermined segment types; and parsing each segment data according to one or more predetermined lexicons and determining whether the segment contains one or more operators, an operator being a natural language token representing an operation that may be performed on data; upon determining that the segment data representing the segment contains operator data representing one or more operators; identifying all operators in the segment data representing the segment; identifying dependency data representing one or more dependencies of the segment data associated with each identified operator; discarding any tokens not identified as either an operator or a dependency; and applying one or more operator-specific rules to each identified operator of the segment data to determine a first predicate structure equivalent to the original natural language text of the segment; and upon determining that the segment data representing the segment does not contain operator data representing one or more operators; identifying each single or multiword token in the segment data that is a predetermined token of the domain; determining any remaining tokens of the segment that are not predetermined tokens of the domain and map the identified tokens and the remaining tokens to one or more predetermined rules, resulting in a first predicate structure for the segment data of the segment being analyzed; mapping one or more of the first predicate structures to one or more predetermined machine-executable functions; and implementing at least one of the mapped machine-executable functions in an electronic document preparation system. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A system, comprising:
-
one or more processors; at least one memory coupled to the one or more processors and storing instructions that, when executed by the one or more processors, cause the system to perform operations comprising; receiving electronic textual data relating to a form for which one or more machine-executable form field functions needs to be determined, the electronic textual data including natural language instructions relating to the determination of one or more form field values of the form; analyzing the electronic textual data to determine sentence data representing a plurality of separate sentences of the electronic textual data; separating the electronic textual data into a data array formed of the sentence data of the determined plurality of separate sentences; for each given sentence of sentence data representing sentences in the data array; isolating segment data of one or more segments of the sentence data while relating each resulting segment to prior and succeeding segments of the sentence data, further storing the isolated segment data in one or more segment data memory locations organized to retain structure and relation of one segment to another; for each segment of the segment data; classifying segment data of each segment as being of a segment type of a plurality of possible segment types, discarding segment data classified as being of one or more particular predetermined segment types; and parsing each segment data according to one or more predetermined lexicons and determining whether the segment contains one or more operators, an operator being a natural language token representing a mathematical operation; upon determining that the segment data representing the segment contains operator data representing one or more operators; identifying all operators in the segment data representing the segment; identifying dependency data representing one or more dependencies of the segment data associated with each identified operator; discarding any tokens not identified as either an operator or a dependency; and applying one or more operator-specific rules to each identified operator of the segment data to determine a first predicate structure equivalent to the original natural language text of the segment; upon determining that the segment data representing the segment does not contain operator data representing one or more operators; identifying each single or multiword token in the segment data that is a predetermined token of the domain; determining any remaining tokens of the segment that are not predetermined tokens of the domain and map the identified tokens and the remaining tokens to one or more predetermined rules, resulting in a first predicate structure for the segment data of the segment being analyzed; mapping one or more of the first predicate structures to one or more predetermined machine-executable functions; and implementing at least one of the mapped machine-executable functions in an electronic document preparation system. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
-
Specification