×

Method and apparatus for improved tokenization of natural language text

  • US 5,890,103 A
  • Filed: 07/19/1996
  • Issued: 03/30/1999
  • Est. Priority Date: 07/19/1995
  • Status: Expired due to Fees
First Claim
Patent Images

1. A computerized tokenizer for identifying a token formed of a string of lexical characters found in a stream of digitized natural language text, the computerized tokenizer comprising:

  • parsing means for extracting lexical and non-lexical characters from the stream of digitized text,identifying means coupled with said parsing means for identifying a set of tokens, each token being formed of a string of parsed lexical characters bounded by non-lexical characters, andfiltering means coupled with said identifying means for selecting a candidate token from said set of tokens, said candidate token being suitable for linguistic processing beyond the identification of tokens.

View all claims
  • 7 Assignments
Timeline View
Assignment View
    ×
    ×