Tokenizer for a natural language processing system

US 7,092,871 B2
Filed: 03/30/2001
Issued: 08/15/2006
Est. Priority Date: 07/20/2000
Status: Expired due to Fees

First Claim

Patent Images

1. A method of segmenting a textual input string including characters separated by spaces, comprising:

receiving the textual input string;

proposing a first segmentation of at least a portion of the input string by segmenting the input string at the spaces to obtain a plurality of tokens;

attempting to validate word boundaries in the first segmentation by submitting the first segmentation to a linguistic knowledge component;

if the first segmentation is not validated, proposing a subsequent segmentation by;

determining whether invalid tokens contain any of a predetermined plurality of multi-character punctuation strings or emoticons;

if so, segmenting the tokens into subtokens based on the multi-character punctuation strings or emoticons;

determining whether invalid tokens contain punctuation marks;

if so, segmenting the tokens into subtokens according to a predetermined precedence hierarchy of punctuation;

determining whether invalid tokens contain both alpha and numeric characters;

if so, segmenting the tokens into subtokens at boundaries between the alpha and numeric characters in the tokens;

submitting the subsequent segmentation to the linguistic knowledge component for validation; and

repeating the steps of proposing a subsequent segmentation and submitting the subsequent segmentation to the linguistic knowledge component until the portion of the input string is validated or the portion of the input string has been segmented according to a predetermined number of segmentation criteria.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention is a segmenter used in a natural language processing system. The segmenter segments a textual input string into tokens for further natural language processing. In accordance with one feature of the invention, the segmenter includes a tokenizer engine that proposes segmentations and submits them to a linguistic knowledge component for validation. In accordance with another feature of the invention, the segmentation system includes language-specific data that contains a precedence hierarchy for punctuation. If proposed tokens in the input string contain punctuation, they can illustratively be broken into subtokens based on the precedence hierarchy.

Citations

10 Claims

1. A method of segmenting a textual input string including characters separated by spaces, comprising:
- receiving the textual input string;
  
  proposing a first segmentation of at least a portion of the input string by segmenting the input string at the spaces to obtain a plurality of tokens;
  
  attempting to validate word boundaries in the first segmentation by submitting the first segmentation to a linguistic knowledge component;
  
  if the first segmentation is not validated, proposing a subsequent segmentation by;
  
  determining whether invalid tokens contain any of a predetermined plurality of multi-character punctuation strings or emoticons;
  
  if so, segmenting the tokens into subtokens based on the multi-character punctuation strings or emoticons;
  
  determining whether invalid tokens contain punctuation marks;
  
  if so, segmenting the tokens into subtokens according to a predetermined precedence hierarchy of punctuation;
  
  determining whether invalid tokens contain both alpha and numeric characters;
  
  if so, segmenting the tokens into subtokens at boundaries between the alpha and numeric characters in the tokens;
  
  submitting the subsequent segmentation to the linguistic knowledge component for validation; and
  
  repeating the steps of proposing a subsequent segmentation and submitting the subsequent segmentation to the linguistic knowledge component until the portion of the input string is validated or the portion of the input string has been segmented according to a predetermined number of segmentation criteria.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1 wherein proposing a subsequent segmentation comprises:
    - reassembling previously segmented subtokens.
  - 3. The method of claim 1 wherein proposing a first segmentation comprises:
    - identifying a token as a group of characters flanked by spaces or either end of the input string.
  - 4. The method of claim 3 wherein proposing a subsequent segmentation comprises:
    - determining whether the token contains either all alpha characters or all numeric characters; and
      
      if so, indicating that the token cannot be validated.
  - 5. The method of claim 4 wherein proposing a subsequent segmentation comprises:
    - determining whether the token includes final punctuation; and
      
      if so, segmenting the token into a subtoken by splitting off the final punctuation.
  - 6. The method of claim 5 wherein proposing a subsequent segmentation comprises:
    - determining whether the token includes one or more edge punctuation marks; and
      
      if so, segmenting the token into subtokens by splitting off the one or more edge punctuation marks according to a predetermined edge punctuation precedence hierarchy.
  - 7. The method of claim 6 wherein proposing a subsequent segmentation comprises:
    - determining whether the token includes one or more internal punctuation marks, internal to the tokens; and
      
      if so, segmenting the token into subtokens based on the one or more internal punctuation marks according to a predetermined internal punctuation precedence hierarchy.

8. A method of segmenting a textual input string including characters separated by spaces, comprising:
- receiving the textual input string;
  
  proposing a first segmentation of at least a portion of the input string by identifying a token as a group of characters flanked by white spaces or either end of the input string;
  
  attempting to validate word boundaries in the first segmentation by submitting the first segmentation to a linguistic knowledge component;
  
  if the first segmentation is not validated, proposing a subsequent segmentation by;
  
  determining whether invalid tokens contain any of a predetermined plurality of multi-character punctuation strings or emoticons;
  
  if so, segmenting the tokens into subtokens based on the multi-character punctuation strings or emoticons;
  
  determining whether invalid tokens contain punctuation marks;
  
  if so, segmenting the tokens into subtokens according to a predetermined precedence hierarchy of punctuation;
  
  determining whether invalid tokens contain both alpha and numeric characters;
  
  if so, segmenting the tokens into subtokens at boundaries between the alpha and numeric characters in the tokens;
  
  submitting the subsequent segmentation to the linguistic knowledge component for validation; and
  
  repeating the steps of proposing a subsequent segmentation and submitting the subsequent segmentation to the linguistic knowledge component until the portion of the input string is validated or the portion of the input string has been segmented according to a predetermined number of segmentation criteria.
- View Dependent Claims (9, 10)
- - 9. The method of claim 8 wherein proposing a subsequent segmentation comprises:
    - determining whether the token includes one or more edge punctuation marks; and
      
      if so, segmenting the token into subtokens by splitting off the one or more edge punctuation marks according to a predetermined edge punctuation precedence hierarchy.
  - 10. The method of claim 9 wherein proposing a subsequent segmentation comprises:
    - determining whether the token includes one or more internal punctuation marks, internal to the tokens; and
      
      if so, segmenting the token into subtokens based on the one or more internal punctuation marks according to a predetermined internal punctuation precedence hierarchy.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Bradlee, David G., Pentheroudakis, Joseph E., Knoll, Sonja S.
Primary Examiner(s)
Young, W. R.
Assistant Examiner(s)
Wozniak, James S.

Application Number

US09/822,976
Publication Number

US 20030023425A1
Time in Patent Office

1,964 Days
Field of Search

704/9, 704/1, 704/4, 704/6, 704/10, 707/6
US Class Current

704/9
CPC Class Codes

G06F 40/226   Validation

G06F 40/268   Morphological analysis

G06F 40/284   Lexical analysis, e.g. toke...

Tokenizer for a natural language processing system

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

10 Claims

Specification

Solutions

Use Cases

Quick Links

Tokenizer for a natural language processing system

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

10 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links