Classification filter for processing data for creating a language model
First Claim
Patent Images
1. A computer-implemented method of processing textual adaptation data for creating a statistical language model that provides prior probability estimates for word sequences, the method comprising, with a computer:
- receiving textual adaptation data comprising textual data which is suitable for creating the statistical language model and non-dictated textual data which is not suitable for creating the statistical language model;
segmenting the textual adaptation data into a sequence of units;
extracting a first set of features for each unit in the sequence;
normalizing the sequence of units to form a normalized sequence of units;
extracting a second set of features for each unit in the normalized sequence of units;
processing the data using a processor operating as a classifier to filter out the non-dictated textual data from the textual adaptation data, thereby identifying at least the textual data suitable for creating the language model, the processing including using a classification model which uses a combination of the first and second sets of features;
outputting the textual data suitable for creating the statistical language model; and
generating the statistical language model from the suitable data, wherein the statistical language model provides prior probability estimates for word sequences to guide a hypothesis search for a likely intended word sequence.
2 Assignments
0 Petitions
Accused Products
Abstract
The method and apparatus utilize a filter to remove a variety of non-dictated words from data based on probability and improve the effectiveness of creating a language model.
-
Citations
20 Claims
-
1. A computer-implemented method of processing textual adaptation data for creating a statistical language model that provides prior probability estimates for word sequences, the method comprising, with a computer:
-
receiving textual adaptation data comprising textual data which is suitable for creating the statistical language model and non-dictated textual data which is not suitable for creating the statistical language model; segmenting the textual adaptation data into a sequence of units; extracting a first set of features for each unit in the sequence; normalizing the sequence of units to form a normalized sequence of units; extracting a second set of features for each unit in the normalized sequence of units; processing the data using a processor operating as a classifier to filter out the non-dictated textual data from the textual adaptation data, thereby identifying at least the textual data suitable for creating the language model, the processing including using a classification model which uses a combination of the first and second sets of features; outputting the textual data suitable for creating the statistical language model; and generating the statistical language model from the suitable data, wherein the statistical language model provides prior probability estimates for word sequences to guide a hypothesis search for a likely intended word sequence. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A computer-readable storage medium having computer-executable instructions for performing steps to process textual adaptation data for creating a statistical language model that provides prior probability estimates for word sequences, the instructions comprising:
-
receiving textual adaptation data comprising textual data which is suitable for creating the statistical language model and non-dictated textual data which is not suitable for creating the statistical language model; dividing the textual adaptation data into a sequence of text units; utilizing a lexicon to extract a first set of features for each text unit in the sequence; utilizing a task independent language model to extract a second set of features for each text unit in the sequence; using a classification model which uses a combination of the first and second sets of features to filter out the non-dictated textual data from the textual adaptation data, thereby ascertaining whether each text unit is suitable for creating the statistical language model; and generating the statistical language model from the suitable text units, wherein the statistical language model provides prior probability estimates for word sequences to guide a hypothesis search for a likely intended word sequence. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
-
-
20. A computer-readable storage medium having computer-executable instructions for performing steps to process textual adaptation data for creating a statistical language model that provides prior probability estimates for word sequences, the instructions comprising:
-
receiving textual adaptation data comprising textual data which is suitable for creating the statistical language model and non-dictated textual data which is not suitable for creating the statistical language model; dividing the textual adaptation data into a sequence of non-normalized text units; extracting a first set of features for each text unit in the sequence of non-normalized text units; normalizing the sequence of non-normalized text units to form a normalized sequence of text units; extracting a second set of features for each text unit in the normalized sequence of text units utilizing a lexicon; extracting a third set of features for each text unit in the normalized sequence of text units utilizing a task independent language model; using a classification model which uses a combination of the first, second, and third sets of features to filter out the non-dictated textual data from the textual adaptation data, thereby ascertaining whether each text unit is suitable for creating the statistical language model; and generating the statistical language model from the suitable text units, wherein the statistical language model provides prior probability estimates for word sequences to guide a hypothesis search for a likely intended word sequence.
-
Specification