Customized tokenization of domain specific text via rules corresponding to a speech recognition vocabulary

US 6,327,561 B1
Filed: 07/07/1999
Issued: 12/04/2001
Est. Priority Date: 07/07/1999
Status: Expired due to Term

First Claim

Patent Images

1. A method for supporting customized tokenization of a segment of domain-specific text comprising the steps of:

loading domain-specific tokenizaticn rules corresponding to said customized tokenization of said segment of domain-specific text;

fully tokenizing said segment of domain-specific text using said loaded domain-specific tokenization rules; and

, further fully tokenizing said fully tokenized segment of domain-specific text using general purpose tokenization rules.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for supporting customized tokenization of domain-specific text acomprises the steps of: loading domain-specific tokenization rules corresponding to the customized tokenization of the domain-specific text; tokenizing the domain-specific text using the loaded domain-specific tokenization rules; and, further tokenizing the domain-specific text using general purpose tokenization rules. The loading step of the inventive method can comprise: loading a speech recognition vocabulary; and, loading domain-specific tokenization rules corresponding to the speech recognition vocabulary. In addition, the tokenizing step can comprise identifying each substring in the domain-specific text matching a regular expression having a corresponding replacement pattern in the loaded domain-specific tokenization rules, and replacing each substring identified in the identifying step with the replacement pattern corresponding to the matched regular expression. Alternatively, the tokenizing step can comprise identifying substrings in the domain-specific text matching a regular expression having a corresponding replacement pattern in the second loaded domain-specific tokenization rules; excluding from further processing the identified substrings having a do-not-replace marker associated with the identified substring; and, replacing each non-excluded identified substring with the replacement pattern corresponding to the matched regular expression.

99 Citations

View as Search Results

17 Claims

1. A method for supporting customized tokenization of a segment of domain-specific text comprising the steps of:
- loading domain-specific tokenizaticn rules corresponding to said customized tokenization of said segment of domain-specific text;
  
  fully tokenizing said segment of domain-specific text using said loaded domain-specific tokenization rules; and
  
  , further fully tokenizing said fully tokenized segment of domain-specific text using general purpose tokenization rules.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method according to claim 1, wherein said loading step comprises:
3. The method according to claim 1, wherein said loading step comprises:
- first loading an active vocabulary;
  
  identifying domain-specific tokenization rules corresponding to said active vocabulary; and
  
  , second loading said domain-specific tokenization rules identified in said identifying step.
4. The method according to claim 1, wherein said tokenizing step comprises:
- identifying each substring in said domain-specific text matching a regular expression having a corresponding replacement pattern in said loaded domain-specific tokenization rules; and
  
  , replacing each substring identified in said identifying step with said replacement pattern corresponding to said matched regular expression.
5. The method according to claim 3, wherein said tokenizing step comprises:
- checking for said second loaded domain-specific tokenization rules; and
  
  , processing said domain-specific text using said second loaded domain-specific tokenization rules only if said second loaded domain-specific tokenization rules are identified in said checking step.
6. The method according to claim 5, wherein said processing step comprises:
- identifying each substring in said domain-specific text matching a regular expression having a corresponding replacement pattern in said second loaded domain-specific tokenization rules; and
  
  , replacing each substring identified in said identifying step with said replacement pattern corresponding to said matched regular expression.
7. The method according to claim 5, wherein said processing step comprises:
- identifying substrings in said domain-specific text matching a regular expression having a corresponding replacement pattern in said second loaded domain-specific tokenization rules;
  
  excluding from further processing said identified substrings having a do-not-replace marker associated with said identified substring; and
  
  , replacing each non-excluded identified substring with said replacement pattern corresponding to said matched regular expression.

8. A computer apparatus programmed with a routine set of instructions stored in a fixed medium, said computer apparatus comprising:
- means for loading domain-specific tokenization rules corresponding to a customized tokenization of a segment of domain-specific text;
  
  first means for fully tokenizing said segment of domain-specific text using said loaded domain-specific rules; and
  
  , second means for further fully tokenizing said segment of domain-specific text using general purpose tokenization rules.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The computer apparatus according to claim 8, wherein said loading means comprises:
10. The computer apparatus according to claim 8, wherein said loading means comprises:
- first means for loading an active vocabulary;
  
  means for identifying any domain-specific tokenization rules corresponding to said active vocabulary; and
  
  , second means for loading said domain-specific tokenization rules identified by said identifying means.
11. The computer apparatus according to claim 8, wherein said first tokenizing means comprises:
- means for identifying each substring in said domain-specific text matching a regular expression having a corresponding replacement pattern in said loaded domain-specific tokenization rules; and
  
  , means for replacing each substring identified by said identifying means with said replacement pattern corresponding to said matched regular expression.
12. The computer apparatus according to claim 10, wherein said first tokenizing means comprises:
- means for checking for said loaded domain-specific tokenization rules; and
  
  , means for processing said domain-specific text using said loaded domain-specific tokenization rules only if said loaded domain-specific tokenization rules are identified by said checking means.
13. The computer apparatus according to claim 12, wherein said processing means comprises:
- means for identifying each substring in said domain-specific text matching a regular expression having a corresponding replacement pattern in said loaded domain-specific tokenization rules; and
  
  , means for replacing each substring identified by said identifying means with said replacement pattern corresponding to said matched regular expression.
14. The computer apparatus according to claim 12, wherein said processing means comprises:
- means for identifying substrings in said domain-specific text matching a regular expression having a corresponding replacement pattern in said loaded domain-specific tokenization rules;
  
  means for excluding from further processing said identified substrings having a do-not-replace marker associated with said identified substring; and
  
  , means for replacing each non-excluded identified substring with said replacement pattern corresponding to said matched regular expression.

15. A system for supporting customized tokenization of a segment of domain-specific text in a speech recognition system comprising:
- a loader for loading domain-specific tokenization rules corresponding to said segment of domain-specific text;
  
  a first domain-specific tokenizer for fully tokenizing said segment of domain-specific text according to said loaded domain-specific tokenization rules; and
  
  , a second general purpose tokenizer for further fully tokenizing said fully tokenized segment of text.
- View Dependent Claims (16, 17)
- - 16. The system according to claim 15, wherein said loader comprises:
17. The system according to claim 16, wherein said first tokenizer comprises a customized tokenizer for tokenizing said domain-specific text according to said domain-specific tokenization rules database corresponding to said customized vocabulary database.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
International Business Machines Corporation
Inventors
Herzog, Martin, Grainger, Bernard John, Smith, Maria E., Crépy, Hubert, Backfried, Gerhard
Primary Examiner(s)
Thomas, Joseph

Application Number

US09/348,516
Time in Patent Office

881 Days
Field of Search

704/9, 704/10, 704/1, 704/231, 704/251, 704/255, 704/257, 704/270, 707/531, 707/532, 707/533, 707/1, 707/5, 707/6
US Class Current

704/9
CPC Class Codes

G06F 40/284 Lexical analysis, e.g. toke...

Customized tokenization of domain specific text via rules corresponding to a speech recognition vocabulary

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

99 Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Customized tokenization of domain specific text via rules corresponding to a speech recognition vocabulary

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

99 Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links