Customized tokenization of domain specific text via rules corresponding to a speech recognition vocabulary
First Claim
1. A method for supporting customized tokenization of a segment of domain-specific text comprising the steps of:
- loading domain-specific tokenizaticn rules corresponding to said customized tokenization of said segment of domain-specific text;
fully tokenizing said segment of domain-specific text using said loaded domain-specific tokenization rules; and
, further fully tokenizing said fully tokenized segment of domain-specific text using general purpose tokenization rules.
2 Assignments
0 Petitions
Accused Products
Abstract
A method for supporting customized tokenization of domain-specific text acomprises the steps of: loading domain-specific tokenization rules corresponding to the customized tokenization of the domain-specific text; tokenizing the domain-specific text using the loaded domain-specific tokenization rules; and, further tokenizing the domain-specific text using general purpose tokenization rules. The loading step of the inventive method can comprise: loading a speech recognition vocabulary; and, loading domain-specific tokenization rules corresponding to the speech recognition vocabulary. In addition, the tokenizing step can comprise identifying each substring in the domain-specific text matching a regular expression having a corresponding replacement pattern in the loaded domain-specific tokenization rules, and replacing each substring identified in the identifying step with the replacement pattern corresponding to the matched regular expression. Alternatively, the tokenizing step can comprise identifying substrings in the domain-specific text matching a regular expression having a corresponding replacement pattern in the second loaded domain-specific tokenization rules; excluding from further processing the identified substrings having a do-not-replace marker associated with the identified substring; and, replacing each non-excluded identified substring with the replacement pattern corresponding to the matched regular expression.
99 Citations
17 Claims
-
1. A method for supporting customized tokenization of a segment of domain-specific text comprising the steps of:
-
loading domain-specific tokenizaticn rules corresponding to said customized tokenization of said segment of domain-specific text;
fully tokenizing said segment of domain-specific text using said loaded domain-specific tokenization rules; and
,further fully tokenizing said fully tokenized segment of domain-specific text using general purpose tokenization rules. - View Dependent Claims (2, 3, 4, 5, 6, 7)
loading a speech recognition vocabulary; and
,loading domain-specific tokenization rules corresponding to said speech recognition vocabulary.
-
-
3. The method according to claim 1, wherein said loading step comprises:
-
first loading an active vocabulary;
identifying domain-specific tokenization rules corresponding to said active vocabulary; and
,second loading said domain-specific tokenization rules identified in said identifying step.
-
-
4. The method according to claim 1, wherein said tokenizing step comprises:
-
identifying each substring in said domain-specific text matching a regular expression having a corresponding replacement pattern in said loaded domain-specific tokenization rules; and
,replacing each substring identified in said identifying step with said replacement pattern corresponding to said matched regular expression.
-
-
5. The method according to claim 3, wherein said tokenizing step comprises:
-
checking for said second loaded domain-specific tokenization rules; and
,processing said domain-specific text using said second loaded domain-specific tokenization rules only if said second loaded domain-specific tokenization rules are identified in said checking step.
-
-
6. The method according to claim 5, wherein said processing step comprises:
-
identifying each substring in said domain-specific text matching a regular expression having a corresponding replacement pattern in said second loaded domain-specific tokenization rules; and
,replacing each substring identified in said identifying step with said replacement pattern corresponding to said matched regular expression.
-
-
7. The method according to claim 5, wherein said processing step comprises:
-
identifying substrings in said domain-specific text matching a regular expression having a corresponding replacement pattern in said second loaded domain-specific tokenization rules;
excluding from further processing said identified substrings having a do-not-replace marker associated with said identified substring; and
,replacing each non-excluded identified substring with said replacement pattern corresponding to said matched regular expression.
-
-
8. A computer apparatus programmed with a routine set of instructions stored in a fixed medium, said computer apparatus comprising:
-
means for loading domain-specific tokenization rules corresponding to a customized tokenization of a segment of domain-specific text;
first means for fully tokenizing said segment of domain-specific text using said loaded domain-specific rules; and
,second means for further fully tokenizing said segment of domain-specific text using general purpose tokenization rules. - View Dependent Claims (9, 10, 11, 12, 13, 14)
first means for loading a speech recognition vocabulary; and
,second means for loading domain-specific tokenization rules corresponding to said speech recognition vocabulary.
-
-
10. The computer apparatus according to claim 8, wherein said loading means comprises:
-
first means for loading an active vocabulary;
means for identifying any domain-specific tokenization rules corresponding to said active vocabulary; and
,second means for loading said domain-specific tokenization rules identified by said identifying means.
-
-
11. The computer apparatus according to claim 8, wherein said first tokenizing means comprises:
-
means for identifying each substring in said domain-specific text matching a regular expression having a corresponding replacement pattern in said loaded domain-specific tokenization rules; and
,means for replacing each substring identified by said identifying means with said replacement pattern corresponding to said matched regular expression.
-
-
12. The computer apparatus according to claim 10, wherein said first tokenizing means comprises:
-
means for checking for said loaded domain-specific tokenization rules; and
,means for processing said domain-specific text using said loaded domain-specific tokenization rules only if said loaded domain-specific tokenization rules are identified by said checking means.
-
-
13. The computer apparatus according to claim 12, wherein said processing means comprises:
-
means for identifying each substring in said domain-specific text matching a regular expression having a corresponding replacement pattern in said loaded domain-specific tokenization rules; and
,means for replacing each substring identified by said identifying means with said replacement pattern corresponding to said matched regular expression.
-
-
14. The computer apparatus according to claim 12, wherein said processing means comprises:
-
means for identifying substrings in said domain-specific text matching a regular expression having a corresponding replacement pattern in said loaded domain-specific tokenization rules;
means for excluding from further processing said identified substrings having a do-not-replace marker associated with said identified substring; and
,means for replacing each non-excluded identified substring with said replacement pattern corresponding to said matched regular expression.
-
-
15. A system for supporting customized tokenization of a segment of domain-specific text in a speech recognition system comprising:
-
a loader for loading domain-specific tokenization rules corresponding to said segment of domain-specific text;
a first domain-specific tokenizer for fully tokenizing said segment of domain-specific text according to said loaded domain-specific tokenization rules; and
,a second general purpose tokenizer for further fully tokenizing said fully tokenized segment of text. - View Dependent Claims (16, 17)
a vocabulary loader for loading a customized vocabulary database; and
,a rule loader for loading a domain-specific tokenization rules database corresponding to said customized vocabulary database.
-
-
17. The system according to claim 16, wherein said first tokenizer comprises a customized tokenizer for tokenizing said domain-specific text according to said domain-specific tokenization rules database corresponding to said customized vocabulary database.
Specification