Methods and apparatus for storing and processing natural language text data as a sequence of fixed length integers
First Claim
1. The method of representing natural language text data in encoded form which comprises the steps of subdividing said text data into a sequence of character strings, forming at least one string table containing an addressable copy of each unique one of said character strings, and forming a sequence of integer values each of which specifies a corresponding one of said character strings.
0 Assignments
0 Petitions
Accused Products
Abstract
A mechanism for more rapidly processing natural language text data and more compactly storing such data in a memory array of 16-bit integers, each integer identifying an individual term in the text data stored in a term lookup table. The original text is parsed into a sequence of substrings consisting of alternating alphanumeric terms and intervening punctuation strings. Each substring (with the exception of a single space between adjacent alphanumeric terms) is translated into an identifying integer placed in the memory array. To perform the conversion of each term into its identifying integer, a term lookup table is searched for a previously stored term which matches the given term and, if a matching term is found, the said given term is converted into the integer which identifies the matching term. If a previously stored matching term is not found, the given term is stored in an available empty location in the term first lookup table and is converted into the integer which addresses that available empty location. High-speed term-to-integer conversion is performed using a vectored binary tree as the term lookup table. High speed searches are performed by scanning the memory array for integers which identify target words, and additional lookup tables which are also addressable by an given term'"'"'s identifying number may be employed to determine attributes of that term. A text file manipulation program employs the integer array text data to rapidly search, display, categorize, annotate, and highlight the text of a natural language text database. Highlighted passages are specified by their starting and ending positions in the integer array and are characterized by stored data which specifies the highlight color, annotation text, and one or more category codes associated with the highlighted passage. A keyword in context listing may be displayed which presents a sorted list of all phrases beginning with any term in a user-specified term list.
-
Citations
20 Claims
- 1. The method of representing natural language text data in encoded form which comprises the steps of subdividing said text data into a sequence of character strings, forming at least one string table containing an addressable copy of each unique one of said character strings, and forming a sequence of integer values each of which specifies a corresponding one of said character strings.
-
9. Apparatus for storing and processing natural language text data consisting of a sequence of encoded characters, said apparatus comprising, in combination,
a parser for subdividing said text data into a sequence of natural language terms and punctuation strings wherein each of said terms consists of characters in a first predetermined set of characters which includes the letters of the natural language alphabet, and wherein each of said punctuation strings consists of characters in a second predetermined set of characters which excludes said letters of the alphabet, a string lookup storage unit for processing said sequence of term and punctuation strings from said parser and for encoding each given one of said term and punctuation strings as an integer value which uniquely specifies the content of said given one of said term and punctuation strings, an integer storage unit for storing the integer values from said string storage lookup unit as a sequence of integer values which represent said natural language text, and means for reproducing said natural language text data in it original form as a sequence of encoded characters by concatenating the terms and punctuation strings whose content is specified by each successive one of said sequence of integer values.
-
15. The method of processing natural language text which comprises a sequence of terms each of which consists of one or more encoded characters, said method comprising the steps of:
-
storing each given unique term in said sequence of terms in a first lookup table at a location which is addressable by a unique integer which corresponds to and identifies said given unique term, employing said first lookup table to convert said sequence of terms into a sequence of corresponding integers, storing said sequence of corresponding integers in a memory unit, retrieving said sequence of corresponding integers from said memory unit, and employing said first lookup table to convert each integer in said sequence of corresponding integers into the term it identifies to reproduce said natural language text.
-
Specification