Method and system for the automatic amendment of speech recognition vocabularies

US 6,975,985 B2
Filed: 11/26/2001
Issued: 12/13/2005
Est. Priority Date: 11/29/2000
Status: Expired due to Fees

First Claim

Patent Images

1. A method of automatically updating a word database and a pronunciation database used by a speech recognition engine to convert speech utterances to text, the method comprising:

taking a realization of spoken audio and a first representation that is an allegedly true textual representation for said realization;

generating a second representation by performing speech recognition on said realization using the word database, said second representation being a time-based transcription of said realization;

expanding said first and second representations to convert each acronym and abbreviation contained in said first and second representations to a speech equivalent;

processing the first representation to remove all markup language tags;

generating a line-by-line output by aligning said first representation and said second representation based on timed intervals derived from the time-based transcription of said realization, each line matching a segment of said first representation and a corresponding segment of said second representation for a particular one of the timed intervals;

detecting and marking each line of output that comprises a one-word segment of said first representation and a one-word segment of said second representation;

for each marked line of output whose one-word segment of said first representation and one-word segment of said second representation are similar, automatically updating said pronunciation database to include said similar one-word segments and a corresponding portion of said spoken audio; and

for each marked line of output whose one-word segment of said first representation and one-word segment of said second representation are dissimilar, automatically updating said word database to include said dissimilar one-word segments and a corresponding portion of said spoken audio.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention provides a method and system to improve speech recognition using an existing audio realization of a spoken text and a true textual representation of the spoken text. The audio realization and the true textual representation can be aligned to reveal time stamps. A speech recognition can be performed on the audio realization to provide a hypothesis textual representation for the audio realization. The aligned true textual representation can be compared with the hypothesis textual representation. Single word pairs from the true and the hypothesis textual representations can be selected where the representations are different. Similarly, single word pairs can be selected from each representation where the representations are identical. A word or pronunciation database can be updated using the selected single word pairs together with the corresponding aligned audio realization.

Citations

18 Claims

1. A method of automatically updating a word database and a pronunciation database used by a speech recognition engine to convert speech utterances to text, the method comprising:
- taking a realization of spoken audio and a first representation that is an allegedly true textual representation for said realization;
  
  generating a second representation by performing speech recognition on said realization using the word database, said second representation being a time-based transcription of said realization;
  
  expanding said first and second representations to convert each acronym and abbreviation contained in said first and second representations to a speech equivalent;
  
  processing the first representation to remove all markup language tags;
  
  generating a line-by-line output by aligning said first representation and said second representation based on timed intervals derived from the time-based transcription of said realization, each line matching a segment of said first representation and a corresponding segment of said second representation for a particular one of the timed intervals;
  
  detecting and marking each line of output that comprises a one-word segment of said first representation and a one-word segment of said second representation;
  
  for each marked line of output whose one-word segment of said first representation and one-word segment of said second representation are similar, automatically updating said pronunciation database to include said similar one-word segments and a corresponding portion of said spoken audio; and
  
  for each marked line of output whose one-word segment of said first representation and one-word segment of said second representation are dissimilar, automatically updating said word database to include said dissimilar one-word segments and a corresponding portion of said spoken audio.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1, further comprising obtaining said first representation by optical character recognition using an optical character recognition device.
  - 3. The method of claim 1, wherein the word database comprises a speaker-dependent database used to adapt the speech recognition to a particular speaker.
  - 4. The method of claim 1, further comprising comparing a recognition quality of said speech recognition of said realization with a recognition quality of a corresponding single-word entry existing in said pronunciation database.

5. A method of automatically updating a word database and a pronunciation database used by a speech recognition engine to convert speech utterances to text, the method comprising:
- taking a realization of spoken audio and a first representation that is an allegedly true textual representation for said realization;
  
  producing a second representation that is a textual representation of said realization by performing a speech recognition on said realization using the word database;
  
  expanding said first and second representations to convert each acronym and abbreviation contained in said first and second representations to a speech equivalent;
  
  generating a line-by-line output by aligning said first representation and said second representation, each line of said output comprising a segment of said first representation, a segment of said second representation, and a time indicator indicating a start time and end time of said segments;
  
  detecting and marking each line of output that comprises a one-word segment of said first representation and a one-word segment of said second representation;
  
  for each marked line of output whose one-word segment of said first representation and one-word segment of said second representation are similar, automatically updating said pronunciation database to include said similar one-word segments and a corresponding portion of said spoken audio; and
  
  for each marked line of output whose one-word segment of said first representation and one-word segment of said second representation are dissimilar, automatically updating said word database to include said dissimilar one-word segments and a corresponding portion of said spoken audio.
- View Dependent Claims (6, 7, 8)
- - 6. The method of claim 5, further comprising obtaining said first representation by optical character recognition using an optical character recognition device.
  - 7. The method of claim 5, wherein the word database comprises a speaker-dependent database used to adapt the speech recognition to a particular speaker.
  - 8. The method of claim 5, further comprising comparing a recognition quality of said speech recognition of said realization with a recognition quality of a corresponding single-word entry existing in said pronunciation database.

9. A system for automatically updating a word database and a pronunciation database, the system comprising:
- an audio device for taking a realization of spoken audio;
  
  an text, reader for taking a first representation that is an allegedly true textual representation of said realization;
  
  a speech recognizer that performs a speech recognition on said realization to generate a second representation from said realization, said second representation being a time-based transcription of said realization;
  
  a word database used by the speech recognizer to perform speech recognition tasks;
  
  an expander that expands said first and second representations to convert each acronym and abbreviation contained in said first and second representations to a speech equivalent;
  
  an aligner configured to generate a line-by-line output by aligning said first representation and said second representation based on timed intervals derived from the time-based transcription of said second representation, each line matching a segment of said first representation and a corresponding segment of said second representation for a particular one of the timed intervals;
  
  a classifier configured to detect and mark each line of output that comprises a one-word segment of said first representation and a one-word segment of said second representation; and
  
  a selector that for each marked line of output whose one-word segment of said first representation and one-word segment of said second representation are similar, automatically updates said pronunciation database to include said similar one-word segments and a corresponding portion of said spoken audio, and for each marked line of output whose one-word segment of said first representation and one-word segment of said second representation are dissimilar, automatically updates said word database to include said dissimilar one-word segments and a corresponding portion of said spoken audio.
- View Dependent Claims (10)
- - 10. The system of claim 9, wherein the text reader comprises an optical character reader.

11. A machine-readable storage, having stored thereon a computer program having a plurality of code sections executable by a machine for causing the machine to perform the steps of:
- taking a realization of spoken audio and a first representation that is an allegedly true textual representation for said realization;
  
  generating a second representation by performing speech recognition on said realization using the word database, said second representation being a time-based transcription of said realization;
  
  expanding said first and second representations to convert each acronym and abbreviation contained in said first and second representations to a speech equivalent;
  
  processing the first representation to remove all markup language tags;
  
  generating a line-by-line output by aligning said first representation and said second representation based on timed intervals derived from the time-based transcription of said second representation, each line matching a segment of said first representation and a corresponding segment of said second representation for a particular one of the timed intervals;
  
  detecting and marking each line of output that comprises a one-word segment of said first representation and a one-word segment of said second representation;
  
  for each marked line of output whose one-word segment of said first representation and one-word segment of said second representation are similar, automatically updating a pronunciation database to include said similar one-word segments and a corresponding portion of said spoken audio; and
  
  for each marked line of output whose one-word segment of said first representation and one-word segment of said second representation are dissimilar, automatically updating a word database to include said dissimilar one-word segments and a corresponding portion of said spoken audio.
- View Dependent Claims (12, 13, 14)
- - 12. The machine-readable storage of claim 11, further comprising a machine-executable code section to perform the step of obtaining said first representation by optical character recognition using an optical character recognition device.
  - 13. The machine-readable storage of claim 11, wherein the word database comprises a speaker-dependent database used to adapt the speech recognition to a particular speaker.
  - 14. The machine-readable storage of claim 11, further comprising a machine-executable code section to perform the step of comparing a recognition quality of said speech recognition of said realization with a recognition quality of a corresponding single-word entry existing in said pronunciation database.

15. A machine-readable storage, having stored thereon a computer program having a plurality of code sections executable by a machine for causing the machine to perform the steps of:
- taking a realization of spoken audio and a first representation that is an allegedly true textual representation for said realization;
  
  producing a second representation that is a textual representation of said realization by performing a speech recognition on said realization using the word database;
  
  expanding said first and second representations to convert each acronym and abbreviation contained in said first and second representations to a speech equivalent;
  
  generating a line-by-line output by aligning said first representation and said second representation, each line of said output comprising a segment of said first representation, a segment of said second representation, and a time indicator indicating a start time and end time of said segments;
  
  detecting and marking each line of output that comprises a one-word segment of said first representation and a one-word segment of said second representation;
  
  for each marked line of output whose one-word segment of said first representation and one-word segment of said second representation are similar, automatically updating a pronunciation database to include said similar one-word segments and a corresponding portion of said spoken audio; and
  
  for each marked line of output whose one-word segment of said first representation and one-word segment of said second representation are dissimilar, automatically updating a word database to include said dissimilar one-word segments and a corresponding portion of said spoken audio.
- View Dependent Claims (16, 17, 18)
- - 16. The machine-readable storage of claim 15, further comprising a machine-executable code section to perform the step of obtaining said first representation by optical character recognition using an optical character recognition device.
  - 17. The machine-readable storage of claim 15, wherein the word database comprises a speaker-dependent database used to adapt the speech recognition to a particular speaker.
  - 18. The machine-readable storage of claim 15, further comprising a machine-executable code section to perform the step of comparing a recognition quality of said speech recognition of said realization with a recognition quality of a corresponding single-word entry existing in said pronunciation database.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Stenzel, Gerhard, Kriechbaum, Werner
Primary Examiner(s)
Young, W. R.
Assistant Examiner(s)
Vo, Huyen X.

Application Number

US09/994,396
Publication Number

US 20020065653A1
Time in Patent Office

1,478 Days
Field of Search

704/235, 704/254, 704/260, 704/252
US Class Current

704/231
CPC Class Codes

G10L 15/06 Creation of reference templ...

G10L 15/187 Phonemic context, e.g. pron...

Method and system for the automatic amendment of speech recognition vocabularies

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for the automatic amendment of speech recognition vocabularies

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links