×

Tokenizer for a natural language processing system

  • US 7,269,547 B2
  • Filed: 07/15/2005
  • Issued: 09/11/2007
  • Est. Priority Date: 07/20/2000
  • Status: Expired due to Fees
First Claim
Patent Images

1. A method in a natural language processing system of segmenting a textual input string including a plurality of characters arranged in character groups separated by white spaces, the method comprising:

  • receiving the input string;

    segmenting the input string into a plurality of proposed tokens, by accessing segmentation criteria arranged in a predetermined hierarchy of segmentation criteria, and segmenting based on the segmentation criteria in an order based on the hierarchy, wherein accessing segmentation criteria includes accessing a precedence hierarchy of punctuation in the language-specific data, the precedence hierarchy being arranged based on binding properties of the punctuation in the precedence hierarchy, and segmenting the input string based on the punctuation in an order based on the precedence hierarchy;

    after segmenting, validating the proposed tokens by submitting each of the proposed tokens to a linguistic knowledge component to determine whether each of the proposed tokens, standing alone, represents a linguistically meaningful unit; and

    repeating the steps of segmenting the input string into one or more different proposed tokens, different from the previously proposed tokens, and thereafter validating the different proposed tokens, if each of the previously proposed tokens does not represent a linguistically meaningful unit.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×