System for chinese tokenization and named entity recognition
First Claim
1. A method of tokenization and named entity recognition of ideographic language, said method including the steps of:
- generating a word lattice for a string of ideographic characters using finite state grammars and a system lexicon, said finite state grammars are a dynamic and complementary extension of said lexicon for creating named entity hypotheses and said lexicon includes single ideographic characters, words, and predetermined features of said characters and words;
generating segmented text by determining word boundaries in said string of ideographic characters using said word lattice dependent upon a contextual language model and one or more entity language models; and
recognizing one or more named entities in said string of ideographic characters using said word lattice dependent upon said contextual language model and said one or more entity language models.
4 Assignments
0 Petitions
Accused Products
Abstract
A system (100, 200) for tokenization and named entity recognition of ideographic language is disclosed. In the system, a word lattice is generated for a string of ideographic characters using finite state grammars (150) and a system lexicon (240). Segmented text is generated by determining word boundaries in the string of ideographic characters using the word lattice dependent upon a contextual language model (152A) and one or more entity language models (152B). One or more named entities is recognized in the string of ideographic characters using the word lattice dependent upon the contextual language model (152A) and the one or more entity language models (152B). The contextual language model (152A) and the one or more entity language models (152B) are each class-based language models. The lexicon (240) includes single ideographic characters, words, and predetermined features of the characters and words.
338 Citations
21 Claims
-
1. A method of tokenization and named entity recognition of ideographic language, said method including the steps of:
-
generating a word lattice for a string of ideographic characters using finite state grammars and a system lexicon, said finite state grammars are a dynamic and complementary extension of said lexicon for creating named entity hypotheses and said lexicon includes single ideographic characters, words, and predetermined features of said characters and words;
generating segmented text by determining word boundaries in said string of ideographic characters using said word lattice dependent upon a contextual language model and one or more entity language models; and
recognizing one or more named entities in said string of ideographic characters using said word lattice dependent upon said contextual language model and said one or more entity language models. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. An apparatus for tokenization and named entity recognition of ideographic language, said apparatus including:
-
means for generating a word lattice for a string of ideographic characters using finite state grammars and a system lexicon, said finite state grammars are a dynamic and complementary extension of said lexicon for creating named entity hypotheses and said lexicon includes single ideographic characters, words, and predetermined features of said characters and words;
means for generating segmented text by determining word boundaries in said string of ideographic characters using said word lattice dependent upon a contextual language model and one or more entity language models; and
means for recognizing one or more named entities in said string of ideographic characters using said word lattice dependent upon said contextual language model and said one or more entity language models. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A computer program product having a computer readable medium having a computer program recorded therein for tokenization and named entity recognition of ideographic language, said computer program product including:
-
computer program means for generating a word lattice for a string of ideographic characters using finite state grammars and a system lexicon, said finite state grammars are a dynamic and complementary extension of said lexicon for creating named entity hypotheses and said lexicon includes single ideographic characters, words, and predetermined features of said characters and words;
computer program means for generating segmented text by determining word boundaries in said string of ideographic characters using said word lattice dependent upon a contextual language model and one or more entity language models; and
computer program means for recognizing one or more named entities in said string of ideographic characters using said word lattice dependent upon said contextual language model and said one or more entity language models. - View Dependent Claims (16, 17, 18, 19, 20, 21)
-
Specification