×

Method for language-independent text tokenization using a character categorization

  • US 4,991,094 A
  • Filed: 04/26/1989
  • Issued: 02/05/1991
  • Est. Priority Date: 04/26/1989
  • Status: Expired due to Fees
First Claim
Patent Images

1. A computer method for isolating words from an input stream of characters forming natural language text, the method comprising the steps of:

  • storing in a data processor a character classification table which describes each character in a character code as being either a delimiter character, an alpha/numeric character or a conditional delimiter character which assumes a function of a delimiter character when it occurs in predefined character contexts;

    inputting to said data processor an input stream of characters which are members of said character code, the input stream forming natural language text;

    building in said data processor a string of alpha/numeric characters to form a word from said input stream of characters produced by said inputting step;

    isolating in said data processor three consecutive characters from said inputting step of said input stream, as a previous character, a current character and a next character;

    accessing in said data processor said character classification table in response to said isolating step, to determine if said current character is a delimiter character, an alpha/numeric character or a conditional delimiter character;

    appending in said data processor said current character to said string when said character classification table identifies it as an alpha/numeric character in said accessing step;

    signaling an output signal from said data processor that said string is a complete word when said character classification table identifies said current character as a delimiter character in said accessing step;

    analyzing in said data processor said previous character, said current character and said next character to determine if said current character assumes the function of a delimiter character when said character classification table identifies said current character as a conditional delimiter character in said accessing step.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×