Method and apparatus for tokenizing text
First Claim
1. A method for tokenizing text with a tokenizing transducer comprising the steps of:
- (a) storing all current configurations including at least one current configuration in a configuration storage unit, the at least one current configuration including an extension state and an output node;
(b) selecting one said at least one current configuration that has not been processed;
(c) processing said selected current configuration wherein processing includes creating any next configurations;
(d) repeating steps b and c until all of said current configurations have been processed;
(e) freeing all of said current configurations;
(f) redefining all of the any next configurations as current configurations and the next text position as the current text position;
(g) counting said current configurations; and
(h) providing output when exactly one next configuration exists.
4 Assignments
0 Petitions
Accused Products
Abstract
An efficient method and apparatus for tokenizing natural language text minimizes required data storage and produces guaranteed incremental output. Id (text) is composed with a tokenizer to create a finite state machine representing tokenization paths. The tokenizer itself is in the form of a finite state transducer. The process is carried out in a breadth-first manner so that all possibilities are explored at each character position before progressing. Output is produced incrementally and occurs only when all paths collapse into one. Output may be delayed until a token boundary is reached. In this manner, the output is guaranteed and will not be retracted unless the text is globally ill-formed. Each time output is produced, storage space is freed for subsequent text processing.
-
Citations
24 Claims
-
1. A method for tokenizing text with a tokenizing transducer comprising the steps of:
-
(a) storing all current configurations including at least one current configuration in a configuration storage unit, the at least one current configuration including an extension state and an output node; (b) selecting one said at least one current configuration that has not been processed; (c) processing said selected current configuration wherein processing includes creating any next configurations; (d) repeating steps b and c until all of said current configurations have been processed; (e) freeing all of said current configurations; (f) redefining all of the any next configurations as current configurations and the next text position as the current text position; (g) counting said current configurations; and (h) providing output when exactly one next configuration exists. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A system for tokenizing text, comprising:
-
an I/O device for inputting recognized text and outputting tokenized text; control means for selecting recognized text and for selecting tokenized text to be output; a tokenizing means for defining configurations each of said configurations comprising, an output node; an extension state; a designation of one of current and next; and a designation of one of processed and unprocessed; determining means for determining all possible transitions at the extension state of each of said configurations; comparing means for comparing a range character with a current text character wherein said determining means defines a path based on the comparison; a configuration storage unit for storing the configurations; and an output storage unit for storing the tokenized text. - View Dependent Claims (20, 21, 22, 23, 24)
-
Specification