Systems and methods for providing online fast speaker adaptation in speech recognition

US 20040172250A1
Filed: 10/16/2003
Published: 09/02/2004
Est. Priority Date: 10/17/2002
Status: Active Grant

First Claim

Patent Images

1. A method for performing speaker adaptation in a speech recognition system, comprising:

receiving an audio segment;

determining whether the audio segment is a first audio segment associated with a speaker turn;

decoding the audio segment to generate a transcription associated with the first audio segment when the audio segment is the first audio segment;

estimating a transformation matrix based on the transcription associated with the first audio segment; and

decoding the audio segment using the transformation matrix to generate a transcription associated with a subsequent audio segment when the audio segment is not the first audio segment.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system (230) performs speaker adaptation when performing speech recognition. The system (230) receives an audio segment and identifies the audio segment as a first audio segment or a subsequent audio segment associated with a speaker turn. The system (230) then decodes the audio segment to generate a transcription associated with the first audio segment when the audio segment is the first audio segment and estimates a transformation matrix based on the transcription associated with the first audio segment. The system (230) decodes the audio segment using the transformation matrix to generate a transcription associated with the subsequent audio segment when the audio segment is the subsequent audio segment.

Citations

28 Claims

1. A method for performing speaker adaptation in a speech recognition system, comprising:
- receiving an audio segment;
  
  determining whether the audio segment is a first audio segment associated with a speaker turn;
  
  decoding the audio segment to generate a transcription associated with the first audio segment when the audio segment is the first audio segment;
  
  estimating a transformation matrix based on the transcription associated with the first audio segment; and
  
  decoding the audio segment using the transformation matrix to generate a transcription associated with a subsequent audio segment when the audio segment is not the first audio segment.
- View Dependent Claims (2, 3, 4, 5, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, wherein the determining whether the audio segment is a first audio segment includes:
    - receiving information identifying a start of the speaker turn, and identifying the audio segment as the first audio segment based on the information.
  - 3. The method of claim 1, wherein the determining whether the audio segment is a first audio segment includes:
    - identifying a start of the speaker turn.
  - 4. The method of claim 3, further comprising:
    - resetting the transformation matrix upon identifying the start of the speaker turn.
  - 5. The method of claim 1, further comprising:
    - reestimating the transformation matrix based on the transcription associated with the subsequent audio segment to obtain a reestimated transformation matrix.
  - 7. The method of claim 1, further comprising:
    - applying the transformation matrix to one or more acoustic models.
  - 8. The method of claim 7, wherein the decoding the audio segment using the transformation matrix includes:
    - using the one or more acoustic models to generate the transcription associated with the subsequent audio segment.
  - 9. The method of claim 1, wherein the estimating a transformation matrix includes:
    - constructing a matrix using features associated with straight cepstrals corresponding to the audio segment, and replicating the matrix to generate the transformation matrix.
  - 10. The method of claim 1, wherein the estimating a transformation matrix includes:
    - using a statistical alignment technique to obtain values for the transformation matrix.
  - 11. The method of claim 10, wherein the statistical alignment technique is a Viterbi alignment technique.

6. The method of 5, further comprising:
- receiving another audio segment associated with the speaker turn; and
  
  decoding the other audio segment using the reestimated transformation matrix.

12. A system for performing speaker adaptation when performing speech recognition, comprising:
- means for receiving an audio segment;
  
  means for identifying the audio segment as a first audio segment or a subsequent audio segment associated with a speaker turn;
  
  means for decoding the audio segment to generate a transcription associated with the first audio segment when the audio segment is the first audio segment;
  
  means for estimating a transformation matrix based on the transcription associated with the first audio segment; and
  
  means for decoding the audio segment using the transformation matrix to generate a transcription associated with the subsequent audio segment when the audio segment is the subsequent audio segment.

13. A decoder within a speech recognition system, comprising:
- a forward decoding stage;
  
  a backward decoding stage; and
  
  a rescoring stage;
  
  at least one of the forward decoding stage, the backward decoding stage, and the rescoring stage being configured to;
  
  receive an audio segment, identify the audio segment as a first audio segment or a subsequent audio segment associated with a speaker turn, decode the audio segment to generate a transcription associated with the first audio segment when the audio segment is the first audio segment, estimate a transformation matrix based on the transcription associated with the first audio segment, and decode the audio segment using the transformation matrix to generate a transcription associated with the subsequent audio segment when the audio segment is the subsequent audio segment.
- View Dependent Claims (14, 15, 16, 17, 19, 20, 21, 22, 23, 24, 25, 26, 27)
- - 14. The decoder of claim 13, wherein when identifying the audio segment, the at least one of the forward decoding stage, the backward decoding stage, and the rescoring stage is configured to:
    - receive information identifying a start of the speaker turn, and identify the audio segment as the first audio segment when the information is received.
  - 15. The decoder of claim 13, wherein when identifying the audio segment, the at least one of the forward decoding stage, the backward decoding stage, and the rescoring stage is configured to:
    - identify a start of the speaker turn.
  - 16. The decoder of claim 15, wherein the at least one of the forward decoding stage, the backward decoding stage, and the rescoring stage is further configured to:
    - reset the transformation matrix upon identifying the start of the speaker turn.
  - 17. The decoder of claim 13, wherein the at least one of the forward decoding stage, the backward decoding stage, and the rescoring stage is further configured to:
    - reestimate the transformation matrix based on the transcription associated with the subsequent audio segment to obtain a reestimated transformation matrix.
  - 19. The decoder of claim 13, wherein the at least one of the forward decoding stage, the backward decoding stage, and the rescoring stage is further configured to:
    - apply the transformation matrix to one or more acoustic models.
  - 20. The decoder of claim 19, wherein when decoding the audio segment using the transformation matrix, the at least one of the forward decoding stage, the backward decoding stage, and the rescoring stage is configured to:
    - use the one or more acoustic models to generate the transcription associated with the subsequent audio segment.
  - 21. The decoder of claim 13, wherein when estimating a transformation matrix, the at least one of the forward decoding stage, the backward decoding stage, and the rescoring stage is configured to:
    - construct a matrix using features associated with straight cepstrals corresponding to the audio segment, and replicate the matrix to generate the transformation matrix.
  - 22. The decoder of claim 13, wherein when estimating a transformation matrix, the at least one of the forward decoding stage, the backward decoding stage, and the rescoring stage is configured to:
    - use a statistical alignment technique to obtain values for the transformation matrix.
  - 23. The decoder of claim 22, wherein the statistical alignment technique is a Viterbi alignment technique.
  - 24. The decoder of claim 13, wherein the backward decoding stage is configured to use transcriptions from the forward decoding stage when estimating the transformation matrix.
  - 25. The decoder of claim 24, wherein the backward decoding stage is configured to wait until transcriptions corresponding to the entire speaker turn are received before estimating the transformation matrix.
  - 26. The decoder of claim 13, wherein the rescoring stage is configured to use transcriptions from at least one of the forward decoding stage and the backward decoding stage when estimating the transformation matrix.
  - 27. The decoder of claim 26, wherein the rescoring stage is configured to wait until transcriptions corresponding to the entire speaker turn are received before estimating the transformation matrix.

18. The decoder of 17, wherein the at least one of the forward decoding stage, the backward decoding stage, and the rescoring stage is further configured to:
- receive another audio segment associated with the speaker turn, and decode the other audio segment using the reestimated transformation matrix.

28. A speech recognition system, comprising:
- speaker change detection logic configured to;
  
  receive a plurality of audio segments, and identify boundaries between speakers associated with the audio segments as speaker turns; and
  
  a decoder configured to;
  
  receive, from the speaker change detection logic, one of the audio segments as a received audio segment associated with one of the speaker turns, identify the received audio segment as a first audio segment or a subsequent audio segment associated with the speaker turn, decode the received audio segment to generate a transcription associated with the first audio segment when the received audio segment is the first audio segment, construct a transformation matrix based on the transcription associated with the first audio segment, and decode the received audio segment using the transformation matrix to generate a transcription associated with the subsequent audio segment when the received audio segment is the subsequent audio segment.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Ramp Holdings Incorporated (Clean Harbors Incorporated)
Original Assignee
BBN Technologies Corporation (Rtx Corporation)
Inventors
Liu, Daben

Granted Patent

US 7,292,977 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/260
CPC Class Codes

G10L 15/28 Constructional details of s...

G10L 15/32 Multiple recognisers used i...

Systems and methods for providing online fast speaker adaptation in speech recognition

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

Citations

28 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for providing online fast speaker adaptation in speech recognition

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

28 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links