Systems and methods for providing acoustic classification

US 20040030550A1
Filed: 07/02/2003
Published: 02/12/2004
Est. Priority Date: 07/03/2002
Status: Active Grant

First Claim

Patent Images

1. A speech recognition system, comprising:

audio classification logic configured to;

receive an audio stream, and detect a plurality of audio features from the audio stream by;

decoding the audio stream to generate a phone-class sequence relating to the audio stream, and processing the phone-class sequence to cluster speakers together, identify known ones of the speakers, and identify language used in the audio stream; and

speech recognition logic configured to recognize speech in the audio stream using the audio features.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech recognition system receives an audio signal and detects various features of the audio signal. For example, the system classifies the audio signal into speech and non-speech portions, genders of speakers corresponding to the speech portions, and channel bandwidths used by the speakers. The system detects speaker turns based on changes in the speakers and assigns labels to the speaker turns. The system verifies the genders of the speakers and the channel bandwidths used by the speakers and identifies one or more languages associated with the audio signal. The system recognizes the speech portions of the audio signal based on the various features of the audio signal.

Citations

37 Claims

1. A speech recognition system, comprising:
- audio classification logic configured to;
  
  receive an audio stream, and detect a plurality of audio features from the audio stream by;
  
  decoding the audio stream to generate a phone-class sequence relating to the audio stream, and processing the phone-class sequence to cluster speakers together, identify known ones of the speakers, and identify language used in the audio stream; and
  
  speech recognition logic configured to recognize speech in the audio stream using the audio features.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The system of claim 1, wherein when decoding the audio stream, the audio classification logic is configured to:
    - classify the audio stream into speech and non-speech portions, identify genders of speakers corresponding to the speech portions, and identify channel bandwidths used by the speakers.
  - 3. The system of claim 2, wherein when classifying the audio stream, the audio classification logic is configured to:
    - further classify the non-speech portions into at least two of silence, music, laughter, applause, and lip smack.
  - 4. The system of claim 2, wherein when decoding the audio stream, the audio classification logic is further configured to:
    - identify speaker turns based on changes in the speakers.
  - 5. The system of claim 2, wherein when processing the phone-class sequence, the audio classification logic is further configured to:
    - refine the identification of channel bandwidths by selecting a majority one of the channel bandwidths for the speakers.
  - 6. The system of claim 2, wherein when processing the phone-class sequence, the audio classification logic is further configured to:
    - refine the identification of genders by selecting a majority one of the genders for the speakers.
  - 7. The system of claim 1, wherein when clustering speakers together, the audio classification logic is configured to:
    - assign identifiers to the speakers.
  - 8. The system of claim 7, wherein when assigning identifiers, the audio classification logic is configured to:
    - identify speech corresponding to one of the speakers who has spoken before, and attach a label to the speech to identify the speech as coming from the one of the speakers.

9. The system 1, wherein when identifying known ones of the speakers, the audio classification logic is configured to:
- identify names of the known ones of the speakers.

10. A method for providing speech recognition, comprising:
- receiving an audio signal;
  
  detecting a plurality of audio features from the audio signal by;
  
  generating a phone-class sequence relating to the audio signal, and processing the phone-class sequence to cluster speakers together, identify known ones of the speakers, and identify language associated with the audio signal; and
  
  recognizing speech in the audio signal using the audio features.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17)
- - 11. The method of claim 10, wherein the generating a phone-class sequence includes:
    - classifying the audio signal into speech and non-speech portions, identifying genders of speakers corresponding to the speech portions, and identifying channel bandwidths used by the speakers.
  - 12. The method of claim 11, wherein the classifying the audio signal includes:
    - classifying the non-speech portions into at least two of silence, music, laughter, applause, and lip smack.
  - 13. The method of claim 11, wherein the generating a phone-class sequence includes:
    - identifying speaker turns based on changes in the speakers.
  - 14. The method of claim 11, wherein the processing the phone-class sequence includes:
    - refining the identification of channel bandwidths by selecting a majority one of the channel bandwidths for the speakers.
  - 15. The method of claim 11, wherein the processing the phone-class sequence includes:
    - refining the identification of genders by selecting a majority one of the genders for the speakers.
  - 16. The method of claim 10, wherein the clustering speakers together includes:
    - assigning identifiers to the speakers.
  - 17. The method of claim 16, wherein the assigning identifiers includes:
    - identifying speech corresponding to one of the speakers who has spoken before, and attaching a label to the speech to identify the speech as coming from the one of the speakers.

18. The method 10, wherein the identifying known ones of the speakers includes:
- identifying a name of the known ones of the speakers.

19. An audio classifier, comprising:
- speech event classification logic configured to;
  
  receive an audio stream, classify the audio stream into speech and non-speech portions, identify genders of speakers corresponding to the speech portions, and identify channel bandwidths used by the speakers;
  
  speaker change detection logic configured to identify speaker turns based on changes in the speakers;
  
  speaker clustering logic configured to generate labels for the speaker turns;
  
  bandwidth refinement logic configured to refine the identification of channel bandwidths by the speech event classification logic;
  
  gender refinement logic configured to refine the identification of genders by the speech event classification logic; and
  
  language identification logic configured to identify one or more languages associated with the audio stream.
- View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27)
- - 20. The audio classifier of claim 19, wherein when classifying the audio stream, the speech event classification logic is further configured to:
    - classify the non-speech portions into at least two of silence, music, laughter, applause, and lip smack.
  - 21. The audio classifier of claim 19, wherein when refining the identification of channel bandwidths, the bandwidth refinement logic is configured to:
    - select a majority one of the channel bandwidths when more than one of the channel bandwidths was identified as used by one of the speakers.
  - 22. The audio classifier of claim 19, wherein the speaker clustering logic is configured to:
    - identify speech corresponding to one of the speakers who has spoken before, and attach a label to the speech to identify the speech as coming from the one of the speakers.
  - 23. The audio classifier of claim 19, further comprising:
    - speaker identification logic configured to identify a name of one or more of the speakers.
  - 24. The audio classifier of claim 23, wherein the speaker identification logic is further configured to determine genders of the speakers.
  - 25. The audio classifier of claim 24, wherein the gender refinement logic is configured to:
    - determine genders of the speakers based on a comparison of the identification of genders by the speech event classification logic and the determination of genders by the speaker identification logic.
  - 26. The audio classifier of claim 19, wherein when refining the identification of genders, the gender refinement logic is configured to:
    - select a majority one of the genders when more than one of the genders was identified as corresponding to one of the speakers.
  - 27. The audio classifier of claim 19, wherein the language identification logic is configured to:
    - generate a n-gram phoneme model, and use the n-gram phoneme model to identify a language used in the speaker turns.

28. An audio classifier, comprising:
- means for receiving an audio stream;
  
  means for classifying the audio stream into speech and non-speech portions, genders of speakers corresponding to the speech portions, and channel bandwidths used by the speakers;
  
  means for detecting speaker turns based on changes in the speakers;
  
  means for generating identifiers for the speaker turns;
  
  means for verifying genders of the speakers and channel bandwidths used by the speakers; and
  
  means for identifying one or more languages associated with the audio stream.

29. A method for detecting features of an audio signal, comprising:
- receiving an audio signal;
  
  classifying the audio signal into speech and non-speech portions, genders of speakers corresponding to the speech portions, and channel bandwidths used by the speakers;
  
  detecting speaker turns based on changes in the speakers;
  
  labeling the speaker turns;
  
  verifying genders of the speakers and channel bandwidths used by the speakers; and
  
  identifying one or more languages associated with the audio signal.
- View Dependent Claims (30, 31, 32, 33, 34)
- - 30. The method of claim 29, wherein the classifying the audio signal includes:
    - classifying the non-speech portions into at least two of silence, music, laughter, applause, and lip smack.
  - 31. The method of claim 29, wherein the verifying channel bandwidths includes:
    - selecting a majority one of the channel bandwidths when more than one of the channel bandwidths was associated with one of the speakers.
  - 32. The method of claim 29, wherein the labeling the speaker turns includes:
    - identifying speech corresponding to one of the speakers who has spoken before, and attaching a label to the speech to identify the speech as coming from the one of the speakers.
  - 33. The method of claim 29, wherein the verifying genders includes:
    - selecting a majority one of the genders when more than one of the genders was associated with one of the speakers.
  - 34. The method of claim 29, wherein the identifying one or more languages includes:
    - generating a n-gram phoneme model, and using the n-gram phoneme model to identify a language used in the speaker turns.

35. A speech recognition system, comprising:
- audio classification logic configured to;
  
  receive an audio signal, identify speech and non-speech portions of the audio signal, determine speaker turns based on the speech and non-speech portions of the audio signal, and determine one or more languages associated with the speaker turns; and
  
  speech recognition logic configured to recognize the speech portions of the audio signal based on the speaker turns and the one or more languages.

36. A speech recognition system, comprising:
- audio classification logic configured to;
  
  receive an audio signal, identify speech and non-speech portions of the audio signal, determine speaker turns based on the speech and non-speech portions of the audio signal, assign labels to the speaker turns, and identify known speakers associated with the speaker turns; and
  
  speech recognition logic configured to recognize the speech portions of the audio signal based on the speaker turns, the labels, and the known speakers.

37. A method for recognizing speech, comprising:
- receiving an audio signal containing speech;
  
  determining genders of speakers associated with the speech;
  
  determining channel bandwidths used by the speakers;
  
  identifying speaker turns based on changes in the speakers;
  
  refining the determination of genders by selecting a majority one of the genders when more than one of the genders was associated with one of the speaker turns; and
  
  refining the determination of channel bandwidths by selecting a majority one of the channel bandwidths when more than one of the channel bandwidths was associated with one of the speaker turns.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Verizon Patent and Licensing Incorporated (Verizon Communications Inc.)
Original Assignee
BBN Technologies Corporation (Rtx Corporation)
Inventors
Kubala, Francis G., Liu, Dabien

Granted Patent

US 7,337,115 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/231
CPC Class Codes

G10L 15/26   Speech to text systems G10L...

G10L 25/78   Detection of presence or ab...

H04M 2201/42   Graphical user interfaces

H04M 2201/60   Medium conversion

H04M 2203/305   Recording playback features...

Y10S 707/99943   Generating database or data...

Systems and methods for providing acoustic classification

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

Citations

37 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for providing acoustic classification

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

37 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links