System and iterative method for lexicon, segmentation and language model joint optimization
First Claim
Patent Images
1. A method comprising:
- developing an initial language model from a lexicon and segmentation derived from a received corpus; and
iteratively refining the initial language model by dynamically updating the lexicon and re-segmenting the corpus according to statistical principles until a threshold of predictive capability is achieved;
wherein;
iteratively refining the language model comprises;
re-segmenting the corpus by determining, for each segment, a probability of occurrence for that segment; and
updating the lexicon from the re-segmented corpus;
updating the lexicon comprises;
identifying a frequency of occurrence for each word of a lexicon in the received corpus; and
deleting the word with the smallest identified frequency from the lexicon; and
the method further comprises re-segmenting the deleted word into two or more smaller words and updating the lexicon with the re-segmented words.
2 Assignments
0 Petitions
Accused Products
Abstract
A method for optimizing a language model is presented comprising developing an initial language model from a lexicon and segmentation derived from a received corpus using a maximum match technique, and iteratively refining the initial language model by dynamically updating the lexicon and re-segmenting the corpus according to statistical principles until a threshold of predictive capability is achieved.
-
Citations
13 Claims
-
1. A method comprising:
-
developing an initial language model from a lexicon and segmentation derived from a received corpus; and
iteratively refining the initial language model by dynamically updating the lexicon and re-segmenting the corpus according to statistical principles until a threshold of predictive capability is achieved;
wherein;
iteratively refining the language model comprises;
re-segmenting the corpus by determining, for each segment, a probability of occurrence for that segment; and
updating the lexicon from the re-segmented corpus;
updating the lexicon comprises;
identifying a frequency of occurrence for each word of a lexicon in the received corpus; and
deleting the word with the smallest identified frequency from the lexicon; and
the method further comprises re-segmenting the deleted word into two or more smaller words and updating the lexicon with the re-segmented words. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
Specification