Context-aware linear time tokenizer
First Claim
Patent Images
1. A context-aware tokenizer comprising:
- at least one context automaton module that generates a context record associated with tokens of an input data stream;
a tokenizing automaton module having a token automaton that partitions said input data stream into predefined tokens based on pattern information contained in said token automaton while simultaneously verifying contextual appropriateness based on said context record.
1 Assignment
0 Petitions
Accused Products
Abstract
A context automaton such as a left context automaton predefined and a right context automaton generate a context record that is combined with pattern knowledge stored in a token automaton to segment an input data stream into tokens. The resulting context-aware tokenizer can be used in many natural language processing application including text-to-speech synthesizers and text processors. The tokenizer is robust in that upon failure to match any explicitly stored token pattern a default token is recognized. Token matching follows a left-to-right longest-match strategy. The overall process operates in linear time, allowing for fast context-dependent tokenization in practice.
77 Citations
20 Claims
-
1. A context-aware tokenizer comprising:
-
at least one context automaton module that generates a context record associated with tokens of an input data stream;
a tokenizing automaton module having a token automaton that partitions said input data stream into predefined tokens based on pattern information contained in said token automaton while simultaneously verifying contextual appropriateness based on said context record. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 10, 11)
-
-
12. A method of tokenizing an input stream comprising:
-
using at least one context automaton to generate a context record associated with tokens of said input stream;
using at least one tokenizing automaton to segment said input stream into predefined tokens based on pattern information contained in said context record. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20)
-
Specification