Captioning Using Socially Derived Acoustic Profiles

US 20140088961A1
Filed: 09/26/2012
Published: 03/27/2014
Est. Priority Date: 09/26/2012
Status: Active Grant

First Claim

Patent Images

1. A method, in a data processing system, for performing dynamic automatic speech recognition on a portion of multimedia content, comprising:

segmenting the multimedia content into a at least one segment, wherein each segment is a homogeneous region of content with regard to speakers and background sounds in the region of content;

identifying, for the at least one segment, a speaker providing speech in an audio track of the at least one segment, using information retrieved from a social network service source;

generating a speech profile for the speaker using information retrieved from the social network service source;

generating an acoustic profile for the segment based on the generated speech profile;

dynamically configuring an automatic speech recognition engine of the data processing system for operation on the at least one segment based on the acoustic profile; and

performing automatic speech recognition operations on the audio track of the at least one segment to generate a textual representation of speech content in the audio track corresponding to the speaker.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Mechanisms for performing dynamic automatic speech recognition on a portion of multimedia content are provided. Multimedia content is segmented into homogeneous segments of content with regard to speakers and background sounds. For the at least one segment, a speaker providing speech in an audio track of the at least one segment is identified using information retrieved from a social network service source. A speech profile for the speaker is generated using information retrieved from the social network service source, an acoustic profile for the segment is generated based on the generated speech profile, and an automatic speech recognition engine is dynamically configured for operation on the at least one segment based on the acoustic profile. Automatic speech recognition operations are performed on the audio track of the at least one segment to generate a textual representation of speech content in the audio track corresponding to the speaker.

Citations

25 Claims

1. A method, in a data processing system, for performing dynamic automatic speech recognition on a portion of multimedia content, comprising:
- segmenting the multimedia content into a at least one segment, wherein each segment is a homogeneous region of content with regard to speakers and background sounds in the region of content;
  
  identifying, for the at least one segment, a speaker providing speech in an audio track of the at least one segment, using information retrieved from a social network service source;
  
  generating a speech profile for the speaker using information retrieved from the social network service source;
  
  generating an acoustic profile for the segment based on the generated speech profile;
  
  dynamically configuring an automatic speech recognition engine of the data processing system for operation on the at least one segment based on the acoustic profile; and
  
  performing automatic speech recognition operations on the audio track of the at least one segment to generate a textual representation of speech content in the audio track corresponding to the speaker.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1, wherein performing automatic speech recognition operations on the audio track comprises generating captioning for at least one segment of the multimedia content.
  - 3. The method of claim 1, wherein identifying a speaker providing speech in an audio track of the at least one segment comprises at least one of performing facial recognition on a corresponding portion of video in the multimedia content, audio pattern matching on the audio track of the at least one segment, or metadata analysis on metadata associated with the at least one segment.
  - 4. The method of claim 1, wherein identifying a speaker providing speech in an audio track of the at least one segment comprises:
    - performing facial recognition on the corresponding portion of video to generate facial data;
      
      performing a search of the social network service source for a user profile having a matching facial image; and
      
      identifying the speaker based on a match between the facial data and the matching facial image.
  - 5. The method of claim 1, wherein identifying a speaker providing speech in an audio track of the at least one segment comprises:
    - generating an audio pattern for the speaker from audio data in the audio track of the at least one segment;
      
      comparing the audio pattern to stored audio patterns for user accounts in the social network service source; and
      
      identifying the speaker based on a match between the audio pattern and a stored audio pattern.
  - 6. The method of claim 1, wherein identifying a speaker providing speech in an audio track of the at least one segment comprises:
    - retrieving metadata associated with the at least one segment;
      
      analyzing the metadata to identify indicators of one or more speakers in the at least one segment; and
      
      comparing the indicator of the one or more speakers in the at least one segment to user identifiers in user accounts of the social network service source to identify a user identifier matching the indicator.
  - 7. The method of claim 1, wherein generating a speech profile for the speaker using information retrieved from the social network service source comprises:
    - analyzing at least one of user profile information, video/audio postings, or text postings associated with a matching user account in the social network service source to identify characteristics of the speaker'"'"'s speech patterns; and
      
      generating a speech profile based on the identified characteristics of the speaker'"'"'s speech patterns.
  - 8. The method of claim 7, wherein analyzing the user profile information comprises determining at least one of an accent, a cadence, or a pattern of speaking based on at least one of home location information or birthplace location information stored in the user profile information.
  - 9. The method of claim 7, wherein analyzing the video/audio postings associated with the matching user account comprises determining at least one of an accent, a cadence, or a pattern of speaking from audio pattern analysis of the video/audio postings.
  - 10. The method of claim 7, wherein analyzing at least one of the video/audio postings or the text postings associated with the matching user account comprises determining a dictionary of words and corresponding weightings that are commonly used in the video/audio postings or text postings.
  - 11. The method of claim 1, wherein generating an acoustic profile for the segment based on the generated speech profile further comprises identifying one or more background sounds in the at least one segment;
    - retrieving a background audio pattern matching the identified one or more background sounds; and
      
      generating the acoustic profile by combining the speech profile for the speaker with the background audio pattern matching the identified one or more background sounds.
  - 12. The method of claim 11, wherein dynamically configuring an automatic speech recognition engine of the data processing system for operation on the at least one segment based on the acoustic profile comprises configuring the automatic speech recognition engine to extract the one or more background sounds from the audio track of the at least one segment based on the background audio pattern in the acoustic profile before performing automatic speech recognition on the speaker'"'"'s speech in the audio track based on the speech profile.

13. A computer program product comprising a computer readable storage medium having a computer readable program stored therein, wherein the computer readable program, when executed on a computing device, causes the computing device to:
- segment a multimedia content into a at least one segment, wherein each segment is a homogeneous region of content with regard to speakers and background sounds in the region of content;
  
  identify, for the at least one segment, a speaker providing speech in an audio track of the at least one segment, using information retrieved from a social network service source;
  
  generate a speech profile for the speaker using information retrieved from the social network service source;
  
  generate an acoustic profile for the segment based on the generated speech profile;
  
  dynamically configure an automatic speech recognition engine of the data processing system for operation on the at least one segment based on the acoustic profile; and
  
  perform automatic speech recognition operations on the audio track of the at least one segment to generate a textual representation of speech content in the audio track corresponding to the speaker.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
- - 14. The computer program product of claim 13, wherein performing automatic speech recognition operations on the audio track comprises generating captioning for at least one segment of the multimedia content.
  - 15. The computer program product of claim 13, wherein identifying a speaker providing speech in an audio track of the at least one segment comprises at least one of performing facial recognition on a corresponding portion of video in the multimedia content, audio pattern matching on the audio track of the at least one segment, or metadata analysis on metadata associated with the at least one segment.
  - 16. The computer program product of claim 13, wherein identifying a speaker providing speech in an audio track of the at least one segment comprises:
    - performing facial recognition on the corresponding portion of video to generate facial data;
      
      performing a search of the social network service source for a user profile having a matching facial image; and
      
      identifying the speaker based on a match between the facial data and the matching facial image.
  - 17. The computer program product of claim 13, wherein identifying a speaker providing speech in an audio track of the at least one segment comprises:
    - generating an audio pattern for the speaker from audio data in the audio track of the at least one segment;
      
      comparing the audio pattern to stored audio patterns for user accounts in the social network service source; and
      
      identifying the speaker based on a match between the audio pattern and a stored audio pattern.
  - 18. The computer program product of claim 13, wherein identifying a speaker providing speech in an audio track of the at least one segment comprises:
    - retrieving metadata associated with the at least one segment;
      
      analyzing the metadata to identify indicators of one or more speakers in the at least one segment; and
      
      comparing the indicator of the one or more speakers in the at least one segment to user identifiers in user accounts of the social network service source to identify a user identifier matching the indicator.
  - 19. The computer program product of claim 13, wherein generating a speech profile for the speaker using information retrieved from the social network service source comprises:
    - analyzing at least one of user profile information, video/audio postings, or text postings associated with a matching user account in the social network service source to identify characteristics of the speaker'"'"'s speech patterns; and
      
      generating a speech profile based on the identified characteristics of the speaker'"'"'s speech patterns.
  - 20. The computer program product of claim 19, wherein analyzing the user profile information comprises determining at least one of an accent, a cadence, or a pattern of speaking based on at least one of home location information or birthplace location information stored in the user profile information.
  - 21. The computer program product of claim 19, wherein analyzing the video/audio postings associated with the matching user account comprises determining at least one of an accent, a cadence, or a pattern of speaking from audio pattern analysis of the video/audio postings.
  - 22. The computer program product of claim 19, wherein analyzing at least one of the video/audio postings or the text postings associated with the matching user account comprises determining a dictionary of words and corresponding weightings that are commonly used in the video/audio postings or text postings.
  - 23. The computer program product of claim 13, wherein generating an acoustic profile for the segment based on the generated speech profile further comprises identifying one or more background sounds in the at least one segment;
    - retrieving a background audio pattern matching the identified one or more background sounds; and
      
      generating the acoustic profile by combining the speech profile for the speaker with the background audio pattern matching the identified one or more background sounds.
  - 24. The computer program product of claim 23, wherein dynamically configuring an automatic speech recognition engine of the data processing system for operation on the at least one segment based on the acoustic profile comprises configuring the automatic speech recognition engine to extract the one or more background sounds from the audio track of the at least one segment based on the background audio pattern in the acoustic profile before performing automatic speech recognition on the speaker'"'"'s speech in the audio track based on the speech profile.

25. An apparatus, comprising:
- a processor; and
  
  a memory coupled to the processor, wherein the memory comprises instructions which, when executed by the processor, cause the processor to;
  
  segment a multimedia content into a at least one segment, wherein each segment is a homogeneous region of content with regard to speakers and background sounds in the region of content;
  
  identify, for the at least one segment, a speaker providing speech in an audio track of the at least one segment, using information retrieved from a social network service source;
  
  generate a speech profile for the speaker using information retrieved from the social network service source;
  
  generate an acoustic profile for the segment based on the generated speech profile;
  
  dynamically configure an automatic speech recognition engine of the data processing system for operation on the at least one segment based on the acoustic profile; and
  
  perform automatic speech recognition operations on the audio track of the at least one segment to generate a textual representation of speech content in the audio track corresponding to the speaker.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Woodward, Elizabeth V., Yan, Shunguo

Granted Patent

US 8,983,836 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/235
CPC Class Codes

G10L 15/22   Procedures used during a sp...

G10L 15/26   Speech to text systems G10L...

G10L 17/00   Speaker identification or v...

G10L 2015/227   of the speaker; Human-fact...

Captioning Using Socially Derived Acoustic Profiles

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

Captioning Using Socially Derived Acoustic Profiles

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links