Tokenizer for a natural language processing system
First Claim
1. A method in a natural language processing system of segmenting a textual input string including a plurality of characters arranged in character groups separated by white spaces, the method comprising:
- receiving the input string;
segmenting the input string into a plurality of proposed tokens, by accessing segmentation criteria arranged in a predetermined hierarchy of segmentation criteria, and segmenting based on the segmentation criteria in an order based on the hierarchy, wherein accessing segmentation criteria includes accessing a precedence hierarchy of punctuation in the language-specific data, the precedence hierarchy being arranged based on binding properties of the punctuation in the precedence hierarchy, and segmenting the input string based on the punctuation in an order based on the precedence hierarchy;
after segmenting, validating the proposed tokens by submitting each of the proposed tokens to a linguistic knowledge component to determine whether each of the proposed tokens, standing alone, represents a linguistically meaningful unit; and
repeating the steps of segmenting the input string into one or more different proposed tokens, different from the previously proposed tokens, and thereafter validating the different proposed tokens, if each of the previously proposed tokens does not represent a linguistically meaningful unit.
1 Assignment
0 Petitions
Accused Products
Abstract
The present invention is a segmenter used in a natural language processing system. The segmenter segments a textual input string into tokens for further natural language processing. In accordance with one feature of the invention, the segmenter includes a tokeinzer engine that proposes segmentations and submits them to a linguistic knowledge component for validation. In accordance with another feature of the invention, the segmentation system includes language specific data that contains a precedence hierarchy for punctuation. If proposed tokens in the input string contain punctuation, they can illustratively be broken into subtokens based on the precedence hierarchy.
-
Citations
20 Claims
-
1. A method in a natural language processing system of segmenting a textual input string including a plurality of characters arranged in character groups separated by white spaces, the method comprising:
-
receiving the input string; segmenting the input string into a plurality of proposed tokens, by accessing segmentation criteria arranged in a predetermined hierarchy of segmentation criteria, and segmenting based on the segmentation criteria in an order based on the hierarchy, wherein accessing segmentation criteria includes accessing a precedence hierarchy of punctuation in the language-specific data, the precedence hierarchy being arranged based on binding properties of the punctuation in the precedence hierarchy, and segmenting the input string based on the punctuation in an order based on the precedence hierarchy; after segmenting, validating the proposed tokens by submitting each of the proposed tokens to a linguistic knowledge component to determine whether each of the proposed tokens, standing alone, represents a linguistically meaningful unit; and repeating the steps of segmenting the input string into one or more different proposed tokens, different from the previously proposed tokens, and thereafter validating the different proposed tokens, if each of the previously proposed tokens does not represent a linguistically meaningful unit. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A segmenter segmenting a textual input string, containing characters, into linguistically meaningful units, the segmenter comprising:
-
a data store storing language specific data indicative of a precedence hierarchy of punctuation arranged based on binding properties of the punctuation; a linguistic knowledge component configured to validate a token as a linguistically meaningful unit; and an engine coupled to the data store and the linguistic knowledge component and configured to receive the input string, access the language specific data in the data store, segment the input string into one or more proposed tokens based at least in part on the precedence hierarchy of punctuation, submit the one or more proposed tokens to the linguistic knowledge component for validation, and if the linguistic knowledge component is unable to validate the one or more proposed tokens, repeat segmenting the input string into one or more different proposed tokens and thereafter submitting the different proposed tokens to the linguistic knowledge component to validate the different proposed tokens. - View Dependent Claims (7)
-
-
8. A method of segmenting a textual input string including characters separated by spaces, comprising:
-
receiving the textual input string; proposing a first segmentation of at least a portion of the input string into a plurality of proposed tokens; attempting to validate the proposed tokens in the first segmentation by submitting the first segmentation to a linguistic knowledge component; and if the first segmentation is not validated, proposing a subsequent segmentation into a different plurality of proposed tokens, different from the first segmentation, and thereafter submitting the subsequent segmentation to the linguistic knowledge component for validation; wherein proposing a subsequent segmentation comprises determining whether invalid proposed tokens contain both alpha and numeric characters, and if so, segmenting the invalid proposed tokens into subtokens at boundaries between the alpha and numeric characters in the tokens based on a predetermined precedence hierarchy of characters. - View Dependent Claims (9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
-
Specification