Method and apparatus for robust efficient parsing
First Claim
1. A method of parsing text to form a logical representation of the text, the logical representation having tokens representing non-terminals and words of the text, the method comprising:
- selecting a token;
identifying an integer that represents the selected token; and
utilizing the integer to identify at least one token of the logical representation that begins with the selected token.
2 Assignments
0 Petitions
Accused Products
Abstract
The present invention provides a method for improving the efficiency of parsing text. Aspects of the invention include representing parse tokens as integers where a portion of the integer indicates the location in which a definition for the token can be found. In a further aspect, an integer representing a token points to an array of tokens that can be activated by the token. In another aspect, a list of pointers to partial parses is created before attempting to parse a next word in the text string. The list of pointers includes pointers to partial parses that are expecting particular semantic tokens. A fourth aspect of the invention utilizes a data structure to list the semantic tokens that have been fully parsed for each span in the input text segment. When a token is fully parsed, the list is accessed to determine if the new token should be discarded.
81 Citations
28 Claims
-
1. A method of parsing text to form a logical representation of the text, the logical representation having tokens representing non-terminals and words of the text, the method comprising:
-
selecting a token;
identifying an integer that represents the selected token; and
utilizing the integer to identify at least one token of the logical representation that begins with the selected token. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A computer-readable medium having a data structure used in parsing text, the data structure comprising:
-
a table of integers, each integer representing a token; and
a table of arrays, wherein each integer in the table of integers acts as a pointer to an array in the table of arrays and wherein each cell in the array contains a token identifier for a token activated by the token represented by the integer that points to the array. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A method of parsing text to form a tokenized representation of the text, the method comprising:
-
selecting a word from the text;
forming a partial parse of a token based on the selected word;
identifying an item that is needed to extend the partial parse; and
placing a pointer to the partial parse in a table associated with a next word in the text, the pointer being mapped from the item that is needed to extend the parse. - View Dependent Claims (12, 13, 14)
-
-
15. A computer-readable medium having a data structure used in parsing, the data structure comprising:
-
a set of mappings, each mapping associated with an item needed to extend a partial parse of a token; and
at least one pointer for each mapping, each pointer identifying at least one partial parse structure that needs the item associated with the mapping. - View Dependent Claims (16, 17, 18, 19)
-
-
20. A method of parsing text to form a representation of the text, the representation having structures that span sub-strings of words in the text, each structure having a token at its root, the method comprising:
-
identifying a first structure that spans a first sub-string of words in the text and has a first token as its root, the first sub-string having a starting position and an ending position;
indexing the first structure by the first token and the staring position and ending position of the first sub-string;
identifying a second structure that spans the first sub-string of words and has the first token as its root;
using the first token and the start position and end position of the first sub-string to locate the first structure; and
removing one of the first structure and second structure from further consideration in the formation of the representation of the text. - View Dependent Claims (21, 22, 23, 24)
-
-
25. A computer-readable medium having a data structure, the data structure comprising:
-
a first address field containing a token, the first address field for indexing an entry field that designates parse structures, the parse structures in an entry field having the token of the first address field as their root node; and
a second address field for further indexing the entry field, the second address field containing a representation of words spanned by a parse structure. - View Dependent Claims (26)
-
-
27. A method of parsing text to identify a parse structure containing tokens that represent non-terminals and words, the method comprising:
-
converting a selected token into a token ID;
using a first portion of the token ID to identify a table containing definitions for tokens of a same type as the selected token;
using a second portion of the token ID to locate the definition for the selected token in the identified table; and
using the definition for the selected token as part of the method of identifying the parse structure.
-
-
28. A computer-readable medium having a data structure used in parsing text into a logical form containing tokens, the data structure comprising:
-
a collection of tables wherein each table is associated with a type of token and each table comprises a collection of definitions for tokens of the type; and
a set of token IDs, each token ID representing a separate token and comprising;
a first part that indicates a table that contains the definition for the token represented by the token ID; and
a second part that indicates the location of the definition of the token represented by the token ID within the table identified by the first part.
-
Specification