Method and apparatus for tokenizing text

US 5,721,939 A
Filed: 08/03/1995
Issued: 02/24/1998
Est. Priority Date: 08/03/1995
Status: Expired due to Term

First Claim

Patent Images

1. A method for tokenizing text with a tokenizing transducer comprising the steps of:

(a) storing all current configurations including at least one current configuration in a configuration storage unit, the at least one current configuration including an extension state and an output node;

(b) selecting one said at least one current configuration that has not been processed;

(c) processing said selected current configuration wherein processing includes creating any next configurations;

(d) repeating steps b and c until all of said current configurations have been processed;

(e) freeing all of said current configurations;

(f) redefining all of the any next configurations as current configurations and the next text position as the current text position;

(g) counting said current configurations; and

(h) providing output when exactly one next configuration exists.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An efficient method and apparatus for tokenizing natural language text minimizes required data storage and produces guaranteed incremental output. Id (text) is composed with a tokenizer to create a finite state machine representing tokenization paths. The tokenizer itself is in the form of a finite state transducer. The process is carried out in a breadth-first manner so that all possibilities are explored at each character position before progressing. Output is produced incrementally and occurs only when all paths collapse into one. Output may be delayed until a token boundary is reached. In this manner, the output is guaranteed and will not be retracted unless the text is globally ill-formed. Each time output is produced, storage space is freed for subsequent text processing.

Citations

24 Claims

1. A method for tokenizing text with a tokenizing transducer comprising the steps of:
- (a) storing all current configurations including at least one current configuration in a configuration storage unit, the at least one current configuration including an extension state and an output node;
  
  (b) selecting one said at least one current configuration that has not been processed;
  
  (c) processing said selected current configuration wherein processing includes creating any next configurations;
  
  (d) repeating steps b and c until all of said current configurations have been processed;
  
  (e) freeing all of said current configurations;
  
  (f) redefining all of the any next configurations as current configurations and the next text position as the current text position;
  
  (g) counting said current configurations; and
  
  (h) providing output when exactly one next configuration exists.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 2. The method of claim 1, further comprising the step of providing output in the form of a finite state machine.
  - 3. The method of claim 1, wherein the processing step comprises:
    - selecting any unprocessed transitions from the extension state of the selected configuration;
      
      determining a range symbol; and
      
      setting an extension state configuration based on the range symbol to one of a current configuration and a next configuration.
  - 4. The method of claim 3, wherein the extension state configuration has an extension state that is the destination of the selected transition.
  - 5. The method of claim 4, wherein the extension state configuration is set to the next configuration when the range symbol matches the current text character and is set to the current configuration when the range symbol is equal to epsilon.
  - 6. The method of claim 5, wherein said processing step further comprises adding an output edge to an output node of the extension state configuration and labelling said output edge with a domain symbol of the selected transition, said output edge pointing to the output node of the selected configuration.
  - 7. The method of claim 6, wherein said processing step is repeated until all of said any unprocessed transitions have been processed.
  - 8. The method of claim 1, wherein said step of providing output occurs when the one next configuration is marked for output.
  - 9. The method of claim 8, wherein said one next configuration is marked for output only if all output edges at its output node are labelled with a token boundary.
  - 10. The method of claim 1, further comprising:
    - defining all possible paths including at least one path in steps a-f;
      
      ending a path prior to reaching an end of text; and
      
      deleting said path in order to free storage space.
  - 11. The method of claim 1, further comprising providing output in the form of a finite state machine having edges pointing to output nodes.
  - 12. The method of claim 1, wherein said step of providing output comprises outputting a best path.
  - 13. The method of claim 1, wherein after said step of providing output, the output is deleted from storage.
  - 14. The method of claim 1, wherein the tokenizing transducer is defined by the following rules:
    - a. .→
      
      epsilon/_-- |.b. |→
      
      epsilon/_-- punctuationc. space→
      
      white-space+; and
      
      d. |→
      
      white-space+.
  - 15. The method of claim 1, wherein said tokenizing transducer is created by the steps of:
    - defining punctuation conventions; and
      
      defining higher level lexical information.
  - 16. The method of claim 15, wherein the higher level lexical information includes a list of compound words and a list of abbreviations.
  - 17. The method of claim 15, further comprising the step of composing the higher level lexical information with the tokenizer to form a finite state transducer.
  - 18. The method of claim 15, further comprising:
    - representing said higher level lexical information as a transducer H;
      
      pairing said tokenizing transducer with H to create acceptable states;
      
      selecting any unprocessed transitions from the extension state of the selected configuration;
      
      determining a range symbol;
      
      determining if said range symbol matches a current text character; and
      
      determining whether said transition results in an output edge that has one of the acceptable states if the range symbol matches the current text character.

19. A system for tokenizing text, comprising:
- an I/O device for inputting recognized text and outputting tokenized text;
  
  control means for selecting recognized text and for selecting tokenized text to be output;
  
  a tokenizing means for defining configurations each of said configurations comprising,an output node;
  
  an extension state;
  
  a designation of one of current and next; and
  
  a designation of one of processed and unprocessed;
  
  determining means for determining all possible transitions at the extension state of each of said configurations;
  
  comparing means for comparing a range character with a current text character wherein said determining means defines a path based on the comparison;
  
  a configuration storage unit for storing the configurations; and
  
  an output storage unit for storing the tokenized text.
- View Dependent Claims (20, 21, 22, 23, 24)
- - 20. The system of claim 19, further comprising counting means for counting a number of configurations remaining in the configuration storage unit, wherein when the number of remaining configurations is exactly equal to one, the I/O device outputs the tokenized text.
  - 21. The system to claim 19, wherein the tokenizing means comprises a tokenizing transducer defined by the following rules:
    - a. .→
      
      epsilon/_-- |.b. |→
      
      epsilon/_-- punctuationc. space→
      
      white-space+; and
      
      d. |→
      
      white-space+.
  - 22. The system according to claim 19, wherein the tokenizing transducer includes punctuation conventions and higher level lexical information.
  - 23. The system according to claim 22, wherein the higher level information includes a list of compound words and a list of abbreviations.
  - 24. The system according to claim 22, wherein the rules are composed with the higher level lexical information to form the tokenizing transducer.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Xerox Corporation (Xerox Holdings Corp.)
Original Assignee
Xerox Corporation (Xerox Holdings Corp.)
Inventors
Kaplan, Ronald M.
Primary Examiner(s)
McElheny, Jr., Donald E.
Assistant Examiner(s)
Thomas, Joseph

Application Number

US08/510,626
Time in Patent Office

936 Days
Field of Search

395/759, 395/752, 395/751
US Class Current

704/9
CPC Class Codes

G06F 40/284 Lexical analysis, e.g. toke...

Method and apparatus for tokenizing text

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for tokenizing text

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links