Tokenizer for a natural language processing system

US 7,269,547 B2
Filed: 07/15/2005
Issued: 09/11/2007
Est. Priority Date: 07/20/2000
Status: Expired due to Fees

First Claim

Patent Images

1. A method in a natural language processing system of segmenting a textual input string including a plurality of characters arranged in character groups separated by white spaces, the method comprising:

receiving the input string;

segmenting the input string into a plurality of proposed tokens, by accessing segmentation criteria arranged in a predetermined hierarchy of segmentation criteria, and segmenting based on the segmentation criteria in an order based on the hierarchy, wherein accessing segmentation criteria includes accessing a precedence hierarchy of punctuation in the language-specific data, the precedence hierarchy being arranged based on binding properties of the punctuation in the precedence hierarchy, and segmenting the input string based on the punctuation in an order based on the precedence hierarchy;

after segmenting, validating the proposed tokens by submitting each of the proposed tokens to a linguistic knowledge component to determine whether each of the proposed tokens, standing alone, represents a linguistically meaningful unit; and

repeating the steps of segmenting the input string into one or more different proposed tokens, different from the previously proposed tokens, and thereafter validating the different proposed tokens, if each of the previously proposed tokens does not represent a linguistically meaningful unit.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention is a segmenter used in a natural language processing system. The segmenter segments a textual input string into tokens for further natural language processing. In accordance with one feature of the invention, the segmenter includes a tokeinzer engine that proposes segmentations and submits them to a linguistic knowledge component for validation. In accordance with another feature of the invention, the segmentation system includes language specific data that contains a precedence hierarchy for punctuation. If proposed tokens in the input string contain punctuation, they can illustratively be broken into subtokens based on the precedence hierarchy.

Citations

20 Claims

1. A method in a natural language processing system of segmenting a textual input string including a plurality of characters arranged in character groups separated by white spaces, the method comprising:
- receiving the input string;
  
  segmenting the input string into a plurality of proposed tokens, by accessing segmentation criteria arranged in a predetermined hierarchy of segmentation criteria, and segmenting based on the segmentation criteria in an order based on the hierarchy, wherein accessing segmentation criteria includes accessing a precedence hierarchy of punctuation in the language-specific data, the precedence hierarchy being arranged based on binding properties of the punctuation in the precedence hierarchy, and segmenting the input string based on the punctuation in an order based on the precedence hierarchy;
  
  after segmenting, validating the proposed tokens by submitting each of the proposed tokens to a linguistic knowledge component to determine whether each of the proposed tokens, standing alone, represents a linguistically meaningful unit; and
  
  repeating the steps of segmenting the input string into one or more different proposed tokens, different from the previously proposed tokens, and thereafter validating the different proposed tokens, if each of the previously proposed tokens does not represent a linguistically meaningful unit.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1 wherein segmenting according to the hierarchy of segmentation criteria comprises:
    - accessing language specific data containing a portion of the segmentation criteria.
  - 3. The method of claim 1 and further comprising:
    - repeating the steps of validating and segmenting until all characters in the input string have been validated or until the predetermined hierarchy of segmentation criteria has been exhausted.
  - 4. The method of claim 1 wherein the linguistic knowledge component includes a lexicon and wherein validating comprises:
    - accessing the lexicon to determine whether it contains the proposed tokens.
  - 5. The method of claim 4 wherein the linguistic knowledge component includes a morphological analyzer and wherein validating comprises:
    - invoking the morphological analyzer to convert a form of the proposed tokens to a morphologically different form; and
      
      accessing the lexicon to determine whether it contains the morphologically different form of the token.

6. A segmenter segmenting a textual input string, containing characters, into linguistically meaningful units, the segmenter comprising:
- a data store storing language specific data indicative of a precedence hierarchy of punctuation arranged based on binding properties of the punctuation;
  
  a linguistic knowledge component configured to validate a token as a linguistically meaningful unit; and
  
  an engine coupled to the data store and the linguistic knowledge component and configured to receive the input string, access the language specific data in the data store, segment the input string into one or more proposed tokens based at least in part on the precedence hierarchy of punctuation, submit the one or more proposed tokens to the linguistic knowledge component for validation, and if the linguistic knowledge component is unable to validate the one or more proposed tokens, repeat segmenting the input string into one or more different proposed tokens and thereafter submitting the different proposed tokens to the linguistic knowledge component to validate the different proposed tokens.
- View Dependent Claims (7)
- - 7. The segmenter of claim 6 wherein the engine is further configured to repeatedly segment the input stream into one or more different proposed tokens based on a predetermined hierarchy of segmentation criteria, and resubmit the one or more different proposed tokens to the linguistic knowledge component until the segmentation criteria are exhausted or all the characters in the input string are validated.

8. A method of segmenting a textual input string including characters separated by spaces, comprising:
- receiving the textual input string;
  
  proposing a first segmentation of at least a portion of the input string into a plurality of proposed tokens;
  
  attempting to validate the proposed tokens in the first segmentation by submitting the first segmentation to a linguistic knowledge component; and
  
  if the first segmentation is not validated, proposing a subsequent segmentation into a different plurality of proposed tokens, different from the first segmentation, and thereafter submitting the subsequent segmentation to the linguistic knowledge component for validation;
  
  wherein proposing a subsequent segmentation comprises determining whether invalid proposed tokens contain both alpha and numeric characters, and if so, segmenting the invalid proposed tokens into subtokens at boundaries between the alpha and numeric characters in the tokens based on a predetermined precedence hierarchy of characters.
- View Dependent Claims (9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 9. The method of claim 8 and further comprising:
    - repeating the steps of proposing a subsequent segmentation and submitting the subsequent segmentation to the linguistic knowledge component until the portion of the input string is validated or the portion of the input string has been segmented according to a predetermined number of segmentation criteria.
  - 10. The method of claim 9 wherein proposing a first segmentation comprises:
    - segmenting the input string at the spaces to obtain the plurality of proposed tokens.
  - 11. The method of claim 10 wherein proposing a subsequent segmentation comprises:
    - determining whether invalid proposed tokens contain any of a predetermined plurality of multi-character punctuation strings or emoticons; and
      
      if so, segmenting the invalid proposed tokens into subtokens based on the multi-character punctuation strings or emoticons.
  - 12. The method of claim 11 wherein proposing a subsequent segmentation comprises:
    - determining whether invalid proposed tokens contain punctuation marks; and
      
      if so, segmenting the invalid proposed tokens into subtokens according to a predetermined precedence hierarchy of punctuation.
  - 13. The method of claim 8 wherein proposing a subsequent segmentation comprises:
    - reassembling previously segmented subtokens.
  - 14. The method of claim 9 wherein proposing a first segmentation comprises:
    - identifying a proposed token as a group of characters flanked by spaces or either end of the input string.
  - 15. The method of claim 14 wherein proposing a subsequent segmentation comprises:
    - determining whether the proposed token contains either all alpha characters or all numeric characters; and
      
      if so, indicating that the proposed token cannot be validated.
  - 16. The method of claim 15 wherein proposing a subsequent segmentation comprises:
    - determining whether the proposed token includes final punctuation; and
      
      if so, segmenting the proposed token into a subtoken by splitting off the final punctuation.
  - 17. The method of claim 16 wherein proposing a subsequent segmentation comprises:
    - determining whether the proposed token includes both alpha and numeric characters; and
      
      if so, segmenting the proposed token into subtokens at a boundary between the alpha and numeric characters.
  - 18. The method of claim 17 wherein proposing a subsequent segmentation comprises:
    - determining whether the proposed token includes one or more of a predetermined set of multi-punctuation characters or emoticons; and
      
      if so, segmenting the proposed token into subtokens based on the multi-punctuation characters or emoticons included in the token.
  - 19. The method of claim 18 wherein proposing a subsequent segmentation comprises:
    - determining whether the proposed token includes one or more edge punctuation marks; and
      
      if so, segmenting the proposed token into subtokens by splitting off the one or more edge punctuation marks according to a predetermined edge punctuation precedence hierarchy.
  - 20. The method of claim 19 wherein proposing a subsequent segmentation comprises:
    - determining whether the proposed token includes one or more internal punctuation marks, internal to the tokens; and
      
      if so, segmenting the proposed token into subtokens based on the one or more internal punctuation marks according to a predetermined internal punctuation precedence hierarchy.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Bradlee, David G., Pentheroudakis, Joseph E., Knoll, Sonja S.
Primary Examiner(s)
Edouard; Patrick N.
Assistant Examiner(s)
Wozniak; James S.

Application Number

US11/182,477
Publication Number

US 20050251381A1
Time in Patent Office

788 Days
Field of Search

704/1, 704/4, 704/6, 704/9, 704/10
US Class Current

704/9
CPC Class Codes

G06F 40/226   Validation

G06F 40/268   Morphological analysis

G06F 40/284   Lexical analysis, e.g. toke...

Tokenizer for a natural language processing system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Tokenizer for a natural language processing system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links