IDENTIFYING CORRESPONDING REGIONS OF CONTENT

US 20150340038A1
Filed: 08/03/2015
Published: 11/26/2015
Est. Priority Date: 08/02/2012
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

as implemented by one or more computing devices configured with specific computer-executable instructions,identifying audio data of an audio content item that is preliminarily correlated to a region of a textual content item;

determining that text within the region of the textual content item does not correspond to the audio data of the audio content item;

transcribing the audio data with automated speech recognition to generate a textual transcription of the audio data;

determining that the textual transcription corresponds to the text within the region of the textual content; and

generating synchronization information that associates the audio data of the audio content with the text within the region of the textual content.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A content alignment service may generate content synchronization information to facilitate the synchronous presentation of audio content and textual content. In some embodiments, a region of the textual content whose correspondence to the audio content is uncertain may be analyzed to determine whether the region of textual content corresponds to one or more words that are audibly presented in the audio content, or whether the region of textual content is a mismatch with respect to the audio content. In some embodiments, words in the textual content that correspond to words in the audio content are synchronously presented, while mismatched words in the textual content may be skipped to maintain synchronous presentation. Accordingly, in one example application, an audiobook is synchronized with an electronic book, so that as the electronic book is displayed, corresponding words of the audiobook are audibly presented.

Citations

20 Claims

1. A computer-implemented method comprising:
- as implemented by one or more computing devices configured with specific computer-executable instructions,identifying audio data of an audio content item that is preliminarily correlated to a region of a textual content item;
  
  determining that text within the region of the textual content item does not correspond to the audio data of the audio content item;
  
  transcribing the audio data with automated speech recognition to generate a textual transcription of the audio data;
  
  determining that the textual transcription corresponds to the text within the region of the textual content; and
  
  generating synchronization information that associates the audio data of the audio content with the text within the region of the textual content.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The computer-implemented method of claim 1, wherein determining that the text within the region of the textual content item does not correspond to the audio data of the audio content item comprises identifying at least one word within the text not represented in the audio data.
  - 3. The computer-implemented method of claim 1, wherein transcribing the audio data with automated speech recognition to generate a textual transcription of the audio data comprises generating a language model using the text within the region of the textual content and applying the language model to the audio data.
  - 4. The computer-implemented method of claim 1, wherein determining that the textual transcription corresponds to the text within the region of the textual content comprises:
    - identifying a word represented in both the textual transcription and the text within the region of the textual content;
      
      assigning a score value to the word; and
      
      determining that the score value of the word satisfies a threshold value.
  - 5. The computer-implemented method of claim 4, wherein score value of the word indicates a frequency of occurrence of the word within the textual content.
  - 6. The computer-implemented method of claim 1, wherein determining that the textual transcription corresponds to the text within the region of the textual content comprises:
    - generating a first phoneme string from the text within the region of the textual content;
      
      generating a second phoneme string from the textual transcription of the audio data; and
      
      determining that the first phoneme string correspond to the second phoneme string.
  - 7. The computer-implemented method of claim 6, wherein determining that the first phoneme string correspond to the second phoneme string comprises determining that a Levenshtein distance between the first phoneme string and the second phoneme string satisfies a threshold value.
  - 8. The computer-implemented method of claim 6, wherein determining that the first phoneme string correspond to the second phoneme string comprises determining that an acoustically confusable hypothesis for the first phoneme string corresponds to the second phoneme string.

9. A system comprising:
- a non-transitory data store storing audio data of an audio content item and text of a textual content item; and
  
  a processor in communication with the non-transitory data store and configured with specific computer-executable instructions that, when executed by the processor, cause the processor to at least;
  
  identify a preliminary correlation between the audio data and a region of the textual content item;
  
  determine that text within the region of the textual content item does not correspond to the audio data of the audio content item;
  
  transcribe the audio data with automated speech recognition to generate a textual transcription of the audio data;
  
  determine that the textual transcription corresponds to the text within the region of the textual content item; and
  
  generate synchronization information that associates the audio data of the audio content with the text within the region of the textual content with.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The system of claim 9, wherein the specific computer-executable instructions cause the processor to at least determine that the text within the region of the textual content item does not correspond to the audio data of the audio content item at least partly by identifying at least one word within the text not represented in the audio data.
  - 11. The system of claim 9, wherein the specific computer-executable instructions cause the processor to at least transcribe the audio data according to an automated speech recognition routine at least partly by generating a language model using the text within the region of the textual content and applying the language model to the audio data.
  - 12. The system of claim 9, wherein the specific computer-executable instructions further cause the processor to at least:
    - identify a word represented in both the textual transcription and the text within the region of the textual content; and
      
      assign a score value to the word based at least in part on a commonality of the word; and
      
      wherein the specific computer-executable instructions cause the processor to determine that the textual transcription corresponds to the text within the region of the textual content at least part by determining that the score value of the word satisfies a threshold value.
  - 13. The system of claim 12, wherein the commonality of the word is determined according to a predetermined data set assigning commonalities to a plurality of words.
  - 14. The system of claim 9, wherein the specific computer-executable instructions further cause the processor to at least:
    - generate a first phoneme string from the text within the region of the textual content; and
      
      generate a second phoneme string from the textual transcription of the audio data; and
      
      wherein the specific computer-executable instructions cause the processor to determine that the textual transcription corresponds to the text within the region of the textual content at least in part by determining that the first phoneme string correspond to the second phoneme string.
  - 15. The system of claim 14, wherein the first phoneme string corresponds to the second phoneme string when a Levenshtein distance between the first phoneme string and the second phoneme string satisfies a threshold value.
  - 16. The system of claim 14, wherein the first phoneme string corresponds to the second phoneme string when an acoustically confusable hypothesis for the first phoneme string corresponds to an acoustically confusable hypothesis for the second phoneme string.

17. Non-transitory computer-readable storage media comprising computer-executable instructions, that, when executed by a computing system, cause the computing system to at least:
- compare text within a region of a textual content item and audio data of an audio content item that is preliminary correlated to the region of the textual content item;
  
  determine that the text within the region the textual content item does not correspond to the audio data of the audio content item;
  
  apply automated speech recognition to the audio data to generate a textual transcription of the audio data;
  
  determine that the textual transcription corresponds to the text within the region of the textual content item; and
  
  generate synchronization information that associates the audio data of the audio content with the text within the region of the textual content item.
- View Dependent Claims (18, 19, 20)
- - 18. The non-transitory computer-readable storage media of claim 17, wherein the computer-executable instructions cause the computing system to at least determine that the text within the region of the textual content item does not correspond to the audio data of the audio content item at least partly by identifying at least one word within the text not represented in the audio data.
  - 19. The non-transitory computer-readable storage media of claim 17, wherein the computer-executable instructions cause the computing system to at least apply automated speech recognition to the audio data to generate a textual transcription of the audio data at least partly by generating a language model using the text within the region of the textual content and applying the language model to the audio data.
  - 20. The non-transitory computer-readable storage media of claim 17, wherein the computer-executable instructions further cause the computing system to at least:
    - identify a word represented in both the textual transcription and the text within the region of the textual content; and
      
      assign a score value to the word; and
      
      wherein the computer-executable instructions cause the computing system to determine that the textual transcription corresponds to the text within the region of the textual content at least in part by determining that the score value of the word satisfies a threshold value.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Audible Incorporated (Amazon.com, Inc.)
Original Assignee
Audible Incorporated (Amazon.com, Inc.)
Inventors
Story, Guy A. Jr., Dzik, Steven C.

Granted Patent

US 9,799,336 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/4393   Multimedia presentations, e...

G10L 15/183   using context dependencies,...

G10L 15/26   Speech to text systems G10L...

IDENTIFYING CORRESPONDING REGIONS OF CONTENT

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

IDENTIFYING CORRESPONDING REGIONS OF CONTENT

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links