Audio locale mismatch detection

US 10,860,648 B1
Filed: 09/12/2018
Issued: 12/08/2020
Est. Priority Date: 09/12/2018
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

receiving, by one or more computer processors coupled to at least one memory, a media file comprising audio data and spoken language metadata, the spoken language metadata comprising an indication of an English language;

extracting, by the one or more computer processors, an audio sample from the audio data of the media file;

generating, by the one or more computer processors, a first text translation of the audio sample using a speech recognition engine based on the English language;

determining, by the one or more computer processors, that the English language does not match a spoken language of the media file based on the first text translation of the audio sample;

generating, by the one or more computer processors, a second text translation of the audio sample using the speech recognition engine based on a Spanish language;

determining, by the one or more computer processors, that the Spanish language does match the spoken language of the media file based on the second text translation; and

replacing, by the one or more computer processors, the indication of the English language in the spoken language metadata of the media file with a second indication of the Spanish language.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems, methods, and computer-readable media are disclosed for detecting a mismatch between the spoken language in an audio file and the audio language that is tagged as the spoken language in the audio file metadata. Example methods may include receiving a media file including spoken language metadata. Certain methods include generating an audio sample from the media file. Certain methods include generating a text translation of the audio sample based on the spoken language metadata. Certain methods include determining that the spoken language metadata does not match a spoken language in the audio sample based on the text translation. Certain methods include sending an indication that the spoken language metadata does not match the spoken language.

26 Citations

View as Search Results

20 Claims

1. A method comprising:
- receiving, by one or more computer processors coupled to at least one memory, a media file comprising audio data and spoken language metadata, the spoken language metadata comprising an indication of an English language;
  
  extracting, by the one or more computer processors, an audio sample from the audio data of the media file;
  
  generating, by the one or more computer processors, a first text translation of the audio sample using a speech recognition engine based on the English language;
  
  determining, by the one or more computer processors, that the English language does not match a spoken language of the media file based on the first text translation of the audio sample;
  
  generating, by the one or more computer processors, a second text translation of the audio sample using the speech recognition engine based on a Spanish language;
  
  determining, by the one or more computer processors, that the Spanish language does match the spoken language of the media file based on the second text translation; and
  
  replacing, by the one or more computer processors, the indication of the English language in the spoken language metadata of the media file with a second indication of the Spanish language.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1, wherein determining that the English language does not match the spoken language further comprises:
    - determining, by the one or more computer processors, that a confidence metric associated with the first text translation does not satisfy a threshold.
  - 3. The method of claim 1, wherein determining that the English language does not match the spoken language further comprises:
    - determining, by the one or more computer processors, an expected bigram frequency based on the English language;
      
      determining, by the one or more computer processors, an actual bigram frequency associated with the first text translation; and
      
      determining, by the one or more computer processors, that the actual bigram frequency does not match the expected bigram frequency.
  - 4. The method of claim 1, wherein determining that the English language does not match the spoken language further comprises:
    - receiving, by the one or more computer processors, a timed text asset associated with the media file; and
      
      determining, by the one or more computer processors, that the first text translation does not match the timed text asset.

5. A method comprising:
- receiving, by one or more computer processors coupled to at least one memory, a media file comprising audio data and spoken language metadata, the spoken language metadata comprising an indication of a first language;
  
  generating, by the one or more computer processors, an audio sample from the audio data of the media file;
  
  generating, by the one or more computer processors, a text translation of the audio sample based on the first language;
  
  determining, by the one or more computer processors, that the first language does not match a spoken language of the media file based on the text translation of the audio sample;
  
  determining, by the one or more computer processors, that a second language matches the spoken language of the media file; and
  
  replacing, by one or more computer processors, the indication of the first language in the spoken language metadata of the media file with a second indication of the second language.
- View Dependent Claims (6, 7, 8, 9, 10, 11, 12)
- - 6. The method of claim 5, wherein determining that the second language matches the spoken language of the media file comprises:
    - generating, by the one or more computer processors, a second text translation of the audio sample based on the second language.
  - 7. The method of claim 5, wherein replacing the indication of the first language in the spoken language metadata with the second indication of the second language comprises:
    - generating, by the one or more computer processors, a second text translation of the audio sample based on the second language;
      
      determining, by the one or more computer processors, a first confidence metric associated with the text translation based on the first language;
      
      determining, by the one or more computer processors, a second confidence metric associated with the second text translation; and
      
      determining, by the one or more computer processors, that the second confidence metric is greater than the first confidence metric.
  - 8. The method of claim 5, wherein determining that the first language does not match the spoken language further comprises:
    - determining, by the one or more computer processors, a confidence metric associated with the text translation; and
      
      determining, by the one or more computer processors, that the confidence metric does not satisfy a threshold.
  - 9. The method of claim 5, wherein determining that the first language does not match the spoken language further comprises:
    - determining, by the one or more computer processors, an expected n-gram frequency based on the first language;
      
      determining, by the one or more computer processors, an actual n-gram frequency associated with the text translation; and
      
      determining, by the one or more computer processors, that the actual n-gram frequency does not match the expected n-gram frequency.
  - 10. The method of claim 9, wherein the expected n-gram frequency comprises a first bigram frequency and the actual n-gram frequency comprises a second bigram frequency.
  - 11. The method of claim 5, wherein determining that the first language does not match the spoken language further comprises:
    - receiving, by the one or more computer processors, a timed text asset associated with the media file; and
      
      determining, by the one or more computer processors, that the text translation does not match the timed text asset.
  - 12. The method of claim 5, wherein determining that the second language matches the spoken language of the media file comprises:
    - generating, by the one or more computer processors, a second audio sample from the media file; and
      
      generating, by the one or more computer processors, a third text translation of the second audio sample based on the second language.

13. A device comprising:
- at least one memory that stores computer-executable instructions; and
  
  at least one processor configured to access the memory and execute the computer-executable instructions to;
  
  receive a media file comprising audio data and spoken language metadata, the spoken language metadata comprising an indication of a first language;
  
  generate an audio sample from the audio data of the media file;
  
  generate a text translation of the audio sample based on the first language;
  
  determine that the first language does not match a spoken language of the media file based on the text translation of the audio sample;
  
  determine that a second language matches the spoken language of the media file; and
  
  replace the indication of the first language in the spoken language metadata of the media file with a second indication of the second language.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
- - 14. The device of claim 13, wherein determining that the second language matches the spoken language of the media file comprises:
    - generating, by the one or more computer processors, a second text translation of the audio sample based on the second language.
  - 15. The device of claim 13, wherein replacing the indication of the first language in the spoken language metadata with the second indication of the second language comprises:
    - generating, by the one or more computer processors, a second text translation of the audio sample based on the second language;
      
      determining, by the one or more computer processors, a first confidence metric associated with the text translation based on the first language;
      
      determining, by the one or more computer processors, a second confidence metric associated with the second text translation; and
      
      determining, by the one or more computer processors, that the second confidence metric is greater than the first confidence metric.
  - 16. The device of claim 13, wherein determining that the first language does not match the spoken language further comprises:
    - determining, by the one or more computer processors, a confidence metric associated with the text translation; and
      
      determining, by the one or more computer processors, that the confidence metric does not satisfy a threshold.
  - 17. The device of claim 16, wherein the confidence metric is an average per word confidence, a standard deviation, or a mean squared error associated with the text translation.
  - 18. The device of claim 13, wherein determining that the first language does not match the spoken language further comprises:
    - determining, by the one or more computer processors, an expected n-gram frequency based on the first language;
      
      determining, by the one or more computer processors, an actual n-gram frequency associated with the text translation; and
      
      determining, by the one or more computer processors, that the actual n-gram frequency does not match the expected n-gram frequency.
  - 19. The device of claim 18, wherein the expected n-gram frequency comprises a first bigram frequency and the actual n-gram frequency comprises a second bigram frequency.
  - 20. The device of claim 13, wherein determining that the first language does not match the spoken language further comprises:
    - receiving, by the one or more computer processors, a timed text asset associated with the media file; and
      
      determining, by the one or more computer processors, that the text translation does not match the timed text asset.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
McCormick, Manolya, Bhat, Vimal, Nun, Shai Ben
Primary Examiner(s)
Shin, Seong-Ah A

Application Number

US16/129,567
Time in Patent Office

818 Days
Field of Search

None
US Class Current
CPC Class Codes

G06F 16/685   using automatically derived...

G10L 15/005   Language recognition

G10L 15/05   Word boundary detection

G10L 15/18   using natural language mode...

G10L 15/32   Multiple recognisers used i...

G10L 2015/088   Word spotting

H04N 21/4394   involving operations for an...

Audio locale mismatch detection

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

26 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Audio locale mismatch detection

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

26 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links