Classification filter for processing data for creating a language model

US 8,165,870 B2
Filed: 02/10/2005
Issued: 04/24/2012
Est. Priority Date: 02/10/2005
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method of processing textual adaptation data for creating a statistical language model that provides prior probability estimates for word sequences, the method comprising, with a computer:

receiving textual adaptation data comprising textual data which is suitable for creating the statistical language model and non-dictated textual data which is not suitable for creating the statistical language model;

segmenting the textual adaptation data into a sequence of units;

extracting a first set of features for each unit in the sequence;

normalizing the sequence of units to form a normalized sequence of units;

extracting a second set of features for each unit in the normalized sequence of units;

processing the data using a processor operating as a classifier to filter out the non-dictated textual data from the textual adaptation data, thereby identifying at least the textual data suitable for creating the language model, the processing including using a classification model which uses a combination of the first and second sets of features;

outputting the textual data suitable for creating the statistical language model; and

generating the statistical language model from the suitable data, wherein the statistical language model provides prior probability estimates for word sequences to guide a hypothesis search for a likely intended word sequence.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The method and apparatus utilize a filter to remove a variety of non-dictated words from data based on probability and improve the effectiveness of creating a language model.

Citations

20 Claims

1. A computer-implemented method of processing textual adaptation data for creating a statistical language model that provides prior probability estimates for word sequences, the method comprising, with a computer:
- receiving textual adaptation data comprising textual data which is suitable for creating the statistical language model and non-dictated textual data which is not suitable for creating the statistical language model;
  
  segmenting the textual adaptation data into a sequence of units;
  
  extracting a first set of features for each unit in the sequence;
  
  normalizing the sequence of units to form a normalized sequence of units;
  
  extracting a second set of features for each unit in the normalized sequence of units;
  
  processing the data using a processor operating as a classifier to filter out the non-dictated textual data from the textual adaptation data, thereby identifying at least the textual data suitable for creating the language model, the processing including using a classification model which uses a combination of the first and second sets of features;
  
  outputting the textual data suitable for creating the statistical language model; and
  
  generating the statistical language model from the suitable data, wherein the statistical language model provides prior probability estimates for word sequences to guide a hypothesis search for a likely intended word sequence.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The computer-implemented method of claim 1 wherein processing includes using a linear classifier based classification model wherein the first set of features is based at least in part on a number of tokens and a number of characters in the sequence of units, and wherein the second set of features is based at least in part on a percentage of tokens requiring normalization in the normalized sequence of units.
  - 3. The computer-implemented method of claim 1 wherein extracting the first and second sets of features includes computing a feature of the classification model that is independent of a vocabulary.
  - 4. The computer-implemented method of claim 3 and further comprising:
    - tokenizing the sequence of units; and
      
      wherein the feature of the classification model comprises at least one of a number of tokens and an average number of characters per token.
  - 5. The computer-implemented method of claim 1 wherein normalizing the sequence of units comprises providing words that would be spoken for portions of the textual adaptation data that do not comprise words, the portions of the textual adaptation data including punctuation marks and numeric elements.
  - 6. The computer-implemented method of claim 1 wherein segmenting comprises:
    - utilizing a word breaker to segment the textual adaptation data into a sequence of units.
  - 7. The computer-implemented method of claim 1 and further comprising:
    - utilizing a lexicon to extract a third set of features for each unit in the normalized sequence of units; and
      
      wherein processing includes processing the data by also using the third set of features in the classification model.
  - 8. The computer-implemented method of claim 7 wherein one of the features in the third set of features includes a percentage of words in the normalized sequence of units that are end-of-sentence words, and wherein a second one of the features in the third set of features includes a percentage of words in the normalized sequence of units that are not in the lexicon.
  - 9. The computer-implemented method of claim 1 and further comprising:
    - utilizing a task independent language model to extract a third set of features for each unit in the normalized sequence of units; and
      
      wherein processing includes processing the data by also using e third set of features in the classification model.
  - 10. The computer-implemented method of claim 9 wherein one of the features in the third set of features includes a perplexity of the normalized sequence of units, wherein a second one of the features in the third set of features includes a percentage of trigrams in the normalized sequence of units that are present in the task independent language model, and wherein a third one of the features in the third set of features includes a percentage of bigrams in the normalized sequence of units that are present in the task independent language model.
  - 11. The computer-implemented method of claim 1 wherein extracting the first and second sets of features includes computing at least some of the features in terms of specific ranges.

12. A computer-readable storage medium having computer-executable instructions for performing steps to process textual adaptation data for creating a statistical language model that provides prior probability estimates for word sequences, the instructions comprising:
- receiving textual adaptation data comprising textual data which is suitable for creating the statistical language model and non-dictated textual data which is not suitable for creating the statistical language model;
  
  dividing the textual adaptation data into a sequence of text units;
  
  utilizing a lexicon to extract a first set of features for each text unit in the sequence;
  
  utilizing a task independent language model to extract a second set of features for each text unit in the sequence;
  
  using a classification model which uses a combination of the first and second sets of features to filter out the non-dictated textual data from the textual adaptation data, thereby ascertaining whether each text unit is suitable for creating the statistical language model; and
  
  generating the statistical language model from the suitable text units, wherein the statistical language model provides prior probability estimates for word sequences to guide a hypothesis search for a likely intended word sequence.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
- - 13. The computer-readable storage medium of claim 12 wherein using a classification model includes using a conditional maximum entropy classification model, and wherein utilizing the lexicon comprises determining a percentage of words that are end-of-sentence words and determining a percentage of words that are not in the lexicon.
  - 14. The computer-readable storage medium of claim 12 wherein using a classification model comprises using at least one of a Perceptron classifier and a Support Vector Machines classifier, and wherein utilizing the task independent language model comprises determining a perplexity of the sequence of text units, determining a percentage of trigrams that are present in the task independent language model, and determining a percentage of bigrams that are present in the task independent language model.
  - 15. The computer-readable storage medium of claim 12 wherein each text unit comprises a plurality of words, and wherein the method further comprises bucketing the first and second sets of features into ranges, and utilizing the ranges in the classification model.
  - 16. The computer-readable storage medium of claim 15 wherein each text unit comprises a line of text, and wherein the classification model uses a weighted linear combination of the ranges.
  - 17. The computer-readable storage medium of claim 12 and further comprising normalizing the textual adaptation data to provide words that would be spoken for portions of the textual adaptation data that do not comprise words, the portions of the textual adaptation data including punctuation marks and numeric elements.
  - 18. The computer-readable storage medium of claim 17 wherein ascertaining includes computing a feature of the classification model for the text units of the textual adaptation data that is independent of normalization of the textual adaptation data and independent of a vocabulary.
  - 19. The computer-readable storage medium of claim 17 wherein ascertaining includes computing a feature of the classification model for the text units of the textual adaptation data that is based on at least one of normalization of the textual adaptation data and comparison to a vocabulary.

20. A computer-readable storage medium having computer-executable instructions for performing steps to process textual adaptation data for creating a statistical language model that provides prior probability estimates for word sequences, the instructions comprising:
- receiving textual adaptation data comprising textual data which is suitable for creating the statistical language model and non-dictated textual data which is not suitable for creating the statistical language model;
  
  dividing the textual adaptation data into a sequence of non-normalized text units;
  
  extracting a first set of features for each text unit in the sequence of non-normalized text units;
  
  normalizing the sequence of non-normalized text units to form a normalized sequence of text units;
  
  extracting a second set of features for each text unit in the normalized sequence of text units utilizing a lexicon;
  
  extracting a third set of features for each text unit in the normalized sequence of text units utilizing a task independent language model;
  
  using a classification model which uses a combination of the first, second, and third sets of features to filter out the non-dictated textual data from the textual adaptation data, thereby ascertaining whether each text unit is suitable for creating the statistical language model; and
  
  generating the statistical language model from the suitable text units, wherein the statistical language model provides prior probability estimates for word sequences to guide a hypothesis search for a likely intended word sequence.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Acero, Alejandro, Yu, Dong, Odell, Julian J., Mahajan, Milind V., Mau, Peter K. L.
Primary Examiner(s)
COLUCCI, MICHAEL C

Application Number

US11/054,819
Publication Number

US 20060178869A1
Time in Patent Office

2,630 Days
Field of Search

707/104.1, 707/7, 704/4, 704/49, 704/2, 704/277, 704/235, 704/256, 704/10, 704/241, 704/255, 704/270, 704/270.1, 704/275, 704/9, 706/45, 715/201, 715/205, 715/210
US Class Current

704/10
CPC Class Codes

G06F 40/216   using statistical methods

G06F 40/284   Lexical analysis, e.g. toke...

G10L 15/063   Training

G10L 15/18   using natural language mode...

G10L 15/183   using context dependencies,...

Classification filter for processing data for creating a language model

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Classification filter for processing data for creating a language model

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links