×

Tokenizer for a natural language processing system

  • US 7,092,871 B2
  • Filed: 03/30/2001
  • Issued: 08/15/2006
  • Est. Priority Date: 07/20/2000
  • Status: Expired due to Fees
First Claim
Patent Images

1. A method of segmenting a textual input string including characters separated by spaces, comprising:

  • receiving the textual input string;

    proposing a first segmentation of at least a portion of the input string by segmenting the input string at the spaces to obtain a plurality of tokens;

    attempting to validate word boundaries in the first segmentation by submitting the first segmentation to a linguistic knowledge component;

    if the first segmentation is not validated, proposing a subsequent segmentation by;

    determining whether invalid tokens contain any of a predetermined plurality of multi-character punctuation strings or emoticons;

    if so, segmenting the tokens into subtokens based on the multi-character punctuation strings or emoticons;

    determining whether invalid tokens contain punctuation marks;

    if so, segmenting the tokens into subtokens according to a predetermined precedence hierarchy of punctuation;

    determining whether invalid tokens contain both alpha and numeric characters;

    if so, segmenting the tokens into subtokens at boundaries between the alpha and numeric characters in the tokens;

    submitting the subsequent segmentation to the linguistic knowledge component for validation; and

    repeating the steps of proposing a subsequent segmentation and submitting the subsequent segmentation to the linguistic knowledge component until the portion of the input string is validated or the portion of the input string has been segmented according to a predetermined number of segmentation criteria.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×