Automatic Collection of Speaker Name Pronunciations
First Claim
1. A method comprising:
- segmenting an audio stream into a plurality of time segments using speaker segmentation and recognition (SSR), each of the plurality of time segments corresponding to a name label, to produce an SSR transcript;
transcribing the audio stream into a plurality of word regions using automatic speech recognition (ASR), each of the plurality of word regions having an associated accuracy confidence, to produce an ASR transcript;
identifying a plurality of low confidence regions from the plurality of word regions, each of the low confidence regions having an associated accuracy confidence below a threshold;
identifying at least one likely name region from the ASR transcript using named entity recognition (NER) rules, wherein the NER rules analyze word regions to identify the at least one likely name region, and the NER rules associate each of the at least one likely name regions with a name label from the SSR transcript corresponding to one of a current, previous, or subsequent time segment in the SSR transcript;
filtering the at least one likely name region with the plurality of low confidence regions to determine at least one low confidence name region;
selecting all of the low confidence name regions associated with a selected name label, the selected name label being selected from the name labels in the SSR transcript;
decoding a phoneme transcript from the audio stream for each of the selected likely name regions using a phoneme decoder; and
correlating the selected name label with all of the phoneme transcripts for the selected likely name regions.
1 Assignment
0 Petitions
Accused Products
Abstract
An audio stream is segmented into a plurality of time segments using speaker segmentation and recognition (SSR), with each time segment corresponding to the speaker'"'"'s name, producing an SSR transcript. The audio stream is transcribed into a plurality of word regions using automatic speech recognition (ASR), with each of the word regions having a measure of the confidence in the accuracy of the translation, producing an ASR transcript. Word regions with a relatively low confidence in the accuracy of the translation are identified. The low confidence regions are filtered using named entity recognition (NER) rules to identify low confidence regions that a likely names. The NER rules associate a region that is identified as a likely name with the name of the speaker corresponding to the current, the previous, or the next time segment. All of the likely name regions associated with that speaker'"'"'s name are selected.
26 Citations
20 Claims
-
1. A method comprising:
-
segmenting an audio stream into a plurality of time segments using speaker segmentation and recognition (SSR), each of the plurality of time segments corresponding to a name label, to produce an SSR transcript; transcribing the audio stream into a plurality of word regions using automatic speech recognition (ASR), each of the plurality of word regions having an associated accuracy confidence, to produce an ASR transcript; identifying a plurality of low confidence regions from the plurality of word regions, each of the low confidence regions having an associated accuracy confidence below a threshold; identifying at least one likely name region from the ASR transcript using named entity recognition (NER) rules, wherein the NER rules analyze word regions to identify the at least one likely name region, and the NER rules associate each of the at least one likely name regions with a name label from the SSR transcript corresponding to one of a current, previous, or subsequent time segment in the SSR transcript; filtering the at least one likely name region with the plurality of low confidence regions to determine at least one low confidence name region; selecting all of the low confidence name regions associated with a selected name label, the selected name label being selected from the name labels in the SSR transcript; decoding a phoneme transcript from the audio stream for each of the selected likely name regions using a phoneme decoder; and correlating the selected name label with all of the phoneme transcripts for the selected likely name regions. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. One or more computer readable storage media encoded with software comprising computer executable instructions and when the software is executed operable to cause a processor to:
-
segment a audio stream into a plurality of time segments using speaker segmentation and recognition (SSR), each of the plurality of time segments corresponding to a name label, to produce an SSR transcript; transcribe the audio stream into a plurality of word regions using automatic speech recognition (ASR), each of the plurality of word regions having an associated accuracy confidence, to produce an ASR transcript; identify a plurality of low confidence regions from the plurality of word regions, each of the low confidence regions having an associated accuracy confidence below a threshold; identify at least one likely name region from the ASR transcript using named entity recognition (NER) rules, wherein the NER rules analyze word regions to identify the at least one likely name region, and the NER rules associate each of the at least one likely name regions with a name label from the SSR transcript corresponding to one of a current, previous, or subsequent time segment; filter the at least one likely name region with the plurality of low confidence regions to determine at least one low confidence name region; select all of the likely low confidence name regions associated with a selected name label, the selected name label being selected from the name labels in the SSR transcript; create a phoneme transcript from the audio stream for each of the selected likely name regions using a phoneme decoder; and correlate the selected name label with all of the phoneme transcripts for the selected likely name regions. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. An apparatus comprising:
-
an input/output interface configured to receive an audio stream; a processor coupled to the input/output interface and configured to; segment the audio stream into a plurality of time segments using speaker segmentation and recognition (SSR), each of the plurality of time segments corresponding to a name label, to produce an SSR transcript; transcribe the audio stream into a plurality of word regions using automatic speech recognition (ASR), each of the plurality of word regions having an associated accuracy confidence, to produce an ASR transcript; identify a plurality of low confidence regions from the plurality of word regions, each of the low confidence regions having an associated accuracy confidence below a threshold; identify at least one likely name region from the ASR transcript using named entity recognition (NER) rules, wherein the NER rules analyze word regions to identify the at least one likely name region, and the NER rules associate each of the at least one likely name regions with a name label from the SSR transcript corresponding to one of a current, previous, or next time segment; filter the at least one likely name region with the plurality of low confidence regions to determine at least one low confidence name region; select all of the likely low confidence name regions associated with a selected name label, the selected name label being selected from the name labels in the SSR transcript; create a phoneme transcript from the audio stream for each of the selected likely name regions using a phoneme decoder; and correlate the selected name label with all of the phoneme transcripts for the selected likely name regions. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification