Automatic collection of speaker name pronunciations

US 9,240,181 B2
Filed: 08/20/2013
Issued: 01/19/2016
Est. Priority Date: 08/20/2013
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

segmenting an audio stream into a plurality of time segments using speaker segmentation and recognition (SSR), each of the plurality of time segments corresponding to a name label, to produce an SSR transcript;

transcribing the audio stream into a plurality of word regions using automatic speech recognition (ASR), each of the plurality of word regions having an associated accuracy confidence, to produce an ASR transcript;

identifying a plurality of low confidence regions from the plurality of word regions, each of the low confidence regions having an associated accuracy confidence below a threshold;

identifying at least one likely name region from the ASR transcript using named entity recognition (NER) rules, wherein the NER rules analyze word regions to identify the at least one likely name region, and the NER rules associate each of the at least one likely name regions with a name label from the SSR transcript corresponding to one of a current, previous, or subsequent time segment in the SSR transcript;

filtering the at least one likely name region with the plurality of low confidence regions to determine at least one low confidence name region;

selecting all of the low confidence name regions associated with a selected name label, the selected name label being selected from the name labels in the SSR transcript;

decoding a phoneme transcript from the audio stream for each of the selected likely name regions using a phoneme decoder; and

correlating the selected name label with all of the phoneme transcripts for the selected likely name regions.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An audio stream is segmented into a plurality of time segments using speaker segmentation and recognition (SSR), with each time segment corresponding to the speaker'"'"'s name, producing an SSR transcript. The audio stream is transcribed into a plurality of word regions using automatic speech recognition (ASR), with each of the word regions having a measure of the confidence in the accuracy of the translation, producing an ASR transcript. Word regions with a relatively low confidence in the accuracy of the translation are identified. The low confidence regions are filtered using named entity recognition (NER) rules to identify low confidence regions that a likely names. The NER rules associate a region that is identified as a likely name with the name of the speaker corresponding to the current, the previous, or the next time segment. All of the likely name regions associated with that speaker'"'"'s name are selected.

19 Citations

View as Search Results

20 Claims

1. A method comprising:
- segmenting an audio stream into a plurality of time segments using speaker segmentation and recognition (SSR), each of the plurality of time segments corresponding to a name label, to produce an SSR transcript;
  
  transcribing the audio stream into a plurality of word regions using automatic speech recognition (ASR), each of the plurality of word regions having an associated accuracy confidence, to produce an ASR transcript;
  
  identifying a plurality of low confidence regions from the plurality of word regions, each of the low confidence regions having an associated accuracy confidence below a threshold;
  
  identifying at least one likely name region from the ASR transcript using named entity recognition (NER) rules, wherein the NER rules analyze word regions to identify the at least one likely name region, and the NER rules associate each of the at least one likely name regions with a name label from the SSR transcript corresponding to one of a current, previous, or subsequent time segment in the SSR transcript;
  
  filtering the at least one likely name region with the plurality of low confidence regions to determine at least one low confidence name region;
  
  selecting all of the low confidence name regions associated with a selected name label, the selected name label being selected from the name labels in the SSR transcript;
  
  decoding a phoneme transcript from the audio stream for each of the selected likely name regions using a phoneme decoder; and
  
  correlating the selected name label with all of the phoneme transcripts for the selected likely name regions.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, further comprising:
    - creating a phoneme string of the selected name label using a grapheme-to-phoneme (G2P) tool;
      
      matching at least a portion of the phoneme string of the selected name label with at least a portion of the phoneme transcript for each of the selected likely name regions.
  - 3. The method of claim 1, further comprising:
    - creating a phoneme string of the selected name label with a grapheme-to-phoneme (G2P tool;
      
      adjusting a length of at least one of the selected likely name regions to match the length of the phoneme string.
  - 4. The method of claim 1, further comprising updating a name pronunciation database, the name pronunciation database correlating a plurality of names with a plurality of phoneme transcripts.
  - 5. The method of claim 1, wherein the name label comprises at least one of a full name, a salutation and last name, or a nickname.
  - 6. The method of claim 1, wherein the automatic speech recognition (ASR) transcript comprises a plurality of time stamps associated with the plurality of word regions.
  - 7. The method of claim 1, wherein identifying at least one likely name region from the ASR transcript comprises using NER rules to analyze word regions only in proximity to each of the plurality of low confidence regions.

8. One or more non-transitory computer readable storage media encoded with software comprising computer executable instructions and when the software is executed operable to cause a processor to:
- segment an audio stream into a plurality of time segments using speaker segmentation and recognition (SSR), each of the plurality of time segments corresponding to a name label, to produce an SSR transcript;
  
  transcribe the audio stream into a plurality of word regions using automatic speech recognition (ASR), each of the plurality of word regions having an associated accuracy confidence, to produce an ASR transcript;
  
  identify a plurality of low confidence regions from the plurality of word regions, each of the low confidence regions having an associated accuracy confidence below a threshold;
  
  identify at least one likely name region from the ASR transcript using named entity recognition (NER) rules, wherein the NER rules analyze word regions to identify the at least one likely name region, and the NER rules associate each of the at least one likely name regions with a name label from the SSR transcript corresponding to one of a current, previous, or subsequent time segment;
  
  filter the at least one likely name region with the plurality of low confidence regions to determine at least one low confidence name region;
  
  select all of the likely low confidence name regions associated with a selected name label, the selected name label being selected from the name labels in the SSR transcript;
  
  create a phoneme transcript from the audio stream for each of the selected likely name regions using a phoneme decoder; and
  
  correlate the selected name label with all of the phoneme transcripts for the selected likely name regions.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The computer readable storage media of claim 8, further comprising computer executable instructions operable to cause the processor to:
    - create a phoneme string of the selected name label using a grapheme-to-phoneme (G2P) tool;
      
      match at least a portion of the phoneme string of the selected name label with at least a portion of the phoneme transcript for each of the selected likely name regions.
  - 10. The computer readable storage media of claim 8, further comprising computer executable instructions operable to cause the processor to:
    - create a phoneme string of the selected name label with a grapheme-to-phoneme (G2P tool;
      
      adjust a length of at least one of the selected likely name regions to match the length of the phoneme string.
  - 11. The computer readable storage media of claim 8, further comprising computer executable instructions operable to cause the processor to update a name pronunciation database, the name pronunciation database correlating a plurality of names with a plurality of phoneme transcripts.
  - 12. The computer readable storage media of claim 8, wherein the name label comprises at least one of a full name, a salutation and last name, or a nickname.
  - 13. The computer readable storage media of claim 8, wherein the automatic speech recognition (ASR) transcript comprises a plurality of time stamps associated with the plurality of word regions.
  - 14. The computer readable storage media of claim 8, wherein the computer executable instructions operable to cause the processor to identify at least one likely name region from the ASR transcript comprises computer executable instructions operable to cause the processor to use NER rules to analyze word regions only in proximity to each of the plurality of low confidence regions.

15. An apparatus comprising:
- an input/output interface configured to receive an audio stream;
  
  a processor coupled to the input/output interface and configured to;
  
  segment the audio stream into a plurality of time segments using speaker segmentation and recognition (SSR), each of the plurality of time segments corresponding to a name label, to produce an SSR transcript;
  
  transcribe the audio stream into a plurality of word regions using automatic speech recognition (ASR), each of the plurality of word regions having an associated accuracy confidence, to produce an ASR transcript;
  
  identify a plurality of low confidence regions from the plurality of word regions, each of the low confidence regions having an associated accuracy confidence below a threshold;
  
  identify at least one likely name region from the ASR transcript using named entity recognition (NER) rules, wherein the NER rules analyze word regions to identify the at least one likely name region, and the NER rules associate each of the at least one likely name regions with a name label from the SSR transcript corresponding to one of a current, previous, or next time segment;
  
  filter the at least one likely name region with the plurality of low confidence regions to determine at least one low confidence name region;
  
  select all of the likely low confidence name regions associated with a selected name label, the selected name label being selected from the name labels in the SSR transcript;
  
  create a phoneme transcript from the audio stream for each of the selected likely name regions using a phoneme decoder; and
  
  correlate the selected name label with all of the phoneme transcripts for the selected likely name regions.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The apparatus of claim 15, wherein the processor is further configured to:
    - create a phoneme string of the selected name label using a grapheme-to-phoneme (G2P) tool;
      
      match at least a portion of the phoneme string of the selected name label with at least a portion of the phoneme transcript for each of the selected likely name regions.
  - 17. The apparatus of claim 15, wherein the processor is further configured to:
    - create a phoneme string of the selected name label with a grapheme-to-phoneme (G2P tool;
      
      adjust a length of at least one of the selected likely name regions to match the length of the phoneme string.
  - 18. The apparatus of claim 15, further comprising a network interface configured to update a name pronunciation database, the name pronunciation database correlating a plurality of names with a plurality of phoneme transcripts.
  - 19. The apparatus of claim 15, wherein the name label comprises at least one of a full name, a salutation and last name, or a nickname.
  - 20. The apparatus of claim 15, wherein the automatic speech recognition (ASR) transcript comprises a plurality of time stamps associated with the plurality of word regions.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cisco Technology, Inc. (Cisco Systems, Inc.)
Original Assignee
Cisco Technology, Inc. (Cisco Systems, Inc.)
Inventors
Khare, Aparna, Agrawal, Neha, Kajarekar, Sachin S., Paulik, Matthias
Primary Examiner(s)
ROBERTS, SHAUN A

Application Number

US13/970,850
Publication Number

US 20150058005A1
Time in Patent Office

882 Days
Field of Search

704/235, 704/243, 704/244, 704/246, 704/9
US Class Current

1/1
CPC Class Codes

G10L 15/04   Segmentation; Word boundary...

G10L 15/063   Training

G10L 15/187   Phonemic context, e.g. pron...

G10L 15/26   Speech to text systems G10L...

G10L 17/00   Speaker identification or v...

G10L 2015/025   Phonemes, fenemes or fenone...

Automatic collection of speaker name pronunciations

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

19 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

Automatic collection of speaker name pronunciations

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

19 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others