Method and apparatus for the automatic separating and indexing of multi-speaker conversations

US 7,496,510 B2
Filed: 11/30/2001
Issued: 02/24/2009
Est. Priority Date: 11/30/2000
Status: Expired due to Term

First Claim

Patent Images

1. A method of processing a continuous audio stream containing human speech from a plurality of speakers related to at least one particular transaction, comprising the steps of:

digitizing the continuous audio stream;

detecting a speaker change in the digitized audio stream;

performing a speaker recognition if a speaker change is detected;

determining whether a recognized speaker is a predetermined speaker; and

transcribing at least part of the continuous audio stream only if the recognized speaker is the predetermined speaker;

wherein said transcribing is processed using a dictionary of speaker-trained data trained by the speaker being transcribed.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed are a method and apparatus for processing a continuous audio stream containing human speech in order to locate a particular speech-based transaction in the audio stream, applying both known speaker recognition and speech recognition techniques. Only the utterances of a particular predetermined speaker are transcribed thus providing an index and a summary of the underlying dialogue(s). In a first scenario, an incoming audio stream, e.g. a speech call from outside, is scanned in order to detect audio segments of the predetermined speaker. These audio segments are then indexed and only the indexed segments are transcribed into spoken or written language. In a second scenario, two or more speakers located in one room are using a multi-user speech recognition system (SRS). For each user there exists a different speaker model and optionally a different dictionary or vocabulary of words already known or trained by the speech or voice recognition system.

Citations

30 Claims

1. A method of processing a continuous audio stream containing human speech from a plurality of speakers related to at least one particular transaction, comprising the steps of:
- digitizing the continuous audio stream;
  
  detecting a speaker change in the digitized audio stream;
  
  performing a speaker recognition if a speaker change is detected;
  
  determining whether a recognized speaker is a predetermined speaker; and
  
  transcribing at least part of the continuous audio stream only if the recognized speaker is the predetermined speaker;
  
  wherein said transcribing is processed using a dictionary of speaker-trained data trained by the speaker being transcribed.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. A method according to claim 1, comprising a further step of protocolling time information for detected speaker changes.
  - 3. A method according to claim 1, wherein the step of detecting a speaker change and/or the step of performing a speaker recognition is/are preceded by a further step of detecting non-speech boundaries between continuous speech segments.
  - 4. A method according to claim 1, wherein the step of detecting a speaker change is accomplished by use of at least one characteristic audio feature, in particular features derived from the spectrum of the audio signal.
  - 5. A method according to claim 1, wherein the step of performing a speaker recognition involves the particular steps of calculating a speaker signature from the audio stream and comparing the calculated speaker signature with at least one known speaker signature.
  - 6. A method according to claim 1 for use in a speech recognition or voice control system comprising at least two speaker-specific speaker models and/or dictionaries, wherein interchanging between the at least two speaker-specific dictionaries is dependent on the detected speaker change and the corresponding recognized speaker.

7. A method of processing a continuous audio stream containing human speech of a plurality of speakers related to at least one particular transaction, comprising the steps of:
- digitizing the continuous audio stream;
  
  detecting a speaker change in the digitized audio stream;
  
  performing a speaker recognition if a speaker change is detected;
  
  determining whether a recognized speaker is a predetermined speaker;
  
  indexing the audio stream with respect to the detected speaker change only if the recognized speaker is the predetermined speaker;
  
  wherein said indexing is processed using a dictionary of speaker-trained data trained by the speaker being transcribed.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. A method according to claim 7, comprising a further step of protocolling time information for detected speaker changes.
  - 9. A method according to claim 7, wherein the step of detecting a speaker change and/or the step of performing a speaker recognition is/are preceded by a further step of detecting non-speech boundaries between continuous speech segments.
  - 10. A method according to claim 7, wherein the step of detecting a speaker change is accomplished by use of at least one characteristic audio feature, in particular features derived from the spectrum of the audio signal.
  - 11. A method according to claim 7, wherein the step of performing a speaker recognition involves the particular steps of calculating a speaker signature from the audio stream and comparing the calculated speaker signature with at least one known speaker signature.
  - 12. A method according to claim 7 for use in a speech recognition or voice control system comprising at least two speaker-specific speaker models and/or dictionaries, wherein interchanging between the at least two speaker-specific dictionaries is dependent on the detected speaker change and the corresponding recognized speaker.

13. An apparatus for processing a continuous audio stream containing human speech from a plurality of speakers related to at least one particular transaction, comprising:
- a digitizer which digitizes the continuous audio stream;
  
  a detector which detects speaker changes in the digitized audio stream;
  
  a recognizer which recognizes the predetermined speaker in the audio stream;
  
  a determiner which determines whether a recognized speaker is a predetermined speaker; and
  
  an initiator which initiates transcription of at least part of the continuous audio stream only if the recognized speaker is the predetermined known speaker;
  
  wherein said transcription is processed using a dictionary of speaker-trained data trained by the speaker being transcribed.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
- - 14. An apparatus according to claim 13, further comprising a detector which detects non-speech boundaries between continuous speech segments.
  - 15. An apparatus according to claim 13, further comprising a scanner which automatically scans a continuous audio record, in particular a continuous audio stream recorded on a data or a signal carrier, and for detecting speaker changes in the continuous audio record.
  - 16. An apparatus according to claim 13, further comprising a monitor which continuously monitors a real-time continuous audio stream and performing the steps ofdigitizing the continuous audio stream;
    - detecting a speaker change in the digitized audio stream;
      
      performing a speaker recognition if a speaker change is detected; and
      
      transcribing at least part of the continuous audio stream if a predetermined speaker is recognized.
  - 17. An apparatus according to claim 13, further comprising a monitor which continuously monitors a real-time continuous audio stream and performing the steps ofdigitizing the continuous audio stream;
    - detecting a speaker change in the digitized audio stream;
      
      performing a speaker recognition if a speaker change is detected; and
      
      indexing the audio stream with respect to the detected speaker change if a predetermined speaker is recognized.
  - 18. An apparatus according to claim 13, further comprising a logging device which protocols time information for the at least one detected speaker change.
  - 19. An apparatus according to claim 13, comprising a marking device which marks at least the beginning of a detected speech segment related to a predetermined speaker.
  - 20. An apparatus according to claim 13, comprising data base which stores speech signatures for at least two speakers.

21. An apparatus for processing a continuous audio stream containing human speech from a plurality of speakers related to at least one particular transaction, comprising:
- a detector which detects speaker changes in the audio stream;
  
  a digitizer which digitizes the continuous audio stream;
  
  a recognizer which recognizes the predetermined speaker in the digitized audio stream;
  
  a determiner which determines whether a recognized speaker is a predetermined speaker; and
  
  an indexer for indexing at least part of the continuous audio stream only if the recognized speaker is the predetermined known speaker;
  
  wherein said indexing is processed using a dictionary of speaker-trained data trained by the speaker being transcribed.
- View Dependent Claims (22, 23, 24, 25, 26, 27, 28)
- - 22. An apparatus according to claim 21, further comprising a detector which detects non-speech boundaries between continuous speech segments.
  - 23. An apparatus according to claim 21, further comprising a scanner which automatically scans a continuous audio record, in particular a continuous audio stream recorded on a data or a signal carrier, and for detecting speaker changes in the continuous audio record.
  - 24. An apparatus according to claim 21, further comprising a monitor which continuously monitors a real-time continuous audio stream and performing the steps ofdigitizing the continuous audio stream;
    - detecting a speaker change in the digitized audio stream;
      
      performing a speaker recognition if a speaker change is detected; and
      
      transcribing at least part of the continuous audio stream if a predetermined speaker is recognized.
  - 25. An apparatus according to claim 21, further comprising a monitor which continuously monitors a real-time continuous audio stream and performing the steps ofdigitizing the continuous audio stream;
    - detecting a speaker change in the digitized audio stream;
      
      performing a speaker recognition if a speaker change is detected; and
      
      indexing the audio stream with respect to the detected speaker change if a predetermined speaker is recognized.
  - 26. An apparatus according to claim 21, further comprising a logging device which protocols time information for the at least one detected speaker change.
  - 27. An apparatus according to claim 21, comprising a marking device which marks at least the beginning of a detected speech segment related to a predetermined speaker.
  - 28. An apparatus according to claim 21, comprising data base which stores speech signatures for at least two speakers.

29. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for processing a continuous audio stream containing human speech from a plurality of speakers related to at least one particular transaction, said method comprising the steps of:
- digitizing the continuous audio stream;
  
  detecting a speaker change in the digitized audio stream;
  
  performing a speaker recognition if a speaker change is detected;
  
  determining whether a recognized speaker is a predetermined speaker; and
  
  transcribing at least part of the continuous audio stream only if the recognized speaker is the predetermined speaker;
  
  wherein said transcribing is processed using a dictionary of speaker-trained data trained by the speaker being transcribed.

30. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for processing a continuous audio stream containing human speech from a plurality of speakers related to at least one particular transaction, said method comprising the steps of:
- digitizing the continuous audio stream;
  
  detecting a speaker change in the digitized audio stream;
  
  performing a speaker recognition if a speaker change is detected;
  
  determining whether a recognized speaker is a predetermined speaker;
  
  indexing the audio stream with respect to the detected speaker change only if the recognized speaker is the predetermined speaker;
  
  wherein said indexing is processed using a dictionary of speaker-trained data trained by the speaker being transcribed.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
International Business Machines Corporation
Inventors
Stenzel, Gerhard, Kriechbaum, Werner, Frank, Joachim
Primary Examiner(s)
Vo; Huyen X.

Application Number

US09/997,957
Publication Number

US 20020091517A1
Time in Patent Office

2,643 Days
Field of Search

704/246, 704/260, 704/273, 704/238, 704/214, 704/208, 704/233, 704/270.1, 704/231, 704/270, 704/247, 704/249, 704/252, 704/235, 704/239, 704/245
US Class Current

704/246
CPC Class Codes

G10L 17/00 Speaker identification or v...

G10L 21/028 using properties of sound s...

Method and apparatus for the automatic separating and indexing of multi-speaker conversations

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for the automatic separating and indexing of multi-speaker conversations

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links