Audio-based annotation of video

US 9,715,902 B2
Filed: 06/05/2014
Issued: 07/25/2017
Est. Priority Date: 06/06/2013
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method, comprising:

obtaining a plurality of user-provided content items that include respective audio information and video information associated with a type of annotation item;

training a set of classifiers using the plurality of user-provided content items to determine a relationship between features of the plurality of user-provided content items and the type of annotation item, the set of classifiers configured to identify a customized list of a set of annotation items;

receiving a content item that includes audio information and video information, the audio information being time synchronized with the video information;

extracting the audio information from the content item;

determining an acoustic pattern that characterizes a portion of the audio information by performing audio analysis on the audio information, wherein the acoustic pattern is associated with a temporal location in the audio information;

determining a corresponding temporal location in the video information based at least in part on the temporal location of the acoustic pattern;

analyzing a portion of the video information at the corresponding temporal location using the set of classifiers to determine a set of matching vectors, individual matching vectors of the set of matching vectors including weighted annotation items from the customized list of the set of annotation items;

determining a weighted summation of the weighted annotation items of the set of matching vectors to generate a merged matching vector; and

selecting an annotation item from the customized list of the set of annotation items from the merged matching vector associated with a highest weight, wherein the annotation item characterizes the video information at the corresponding temporal location.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A technique for determining annotation items associated with video information is described. During this annotation technique, a content item that includes audio information and the video information is received. For example, a file may be downloaded from a uniform resource locator. Then, the audio information is extracted from the content item, and the audio information is analyzed to determine features or descriptors that characterize the audio information. Note that the features may be determined solely by analyzing the audio information or may be determined by subsequent further analysis of at least some of the video information based on the analysis of the audio information (i.e., sequential or cascaded analysis). Next, annotation items or tags associated with the video information are determined based on the features.

Citations

21 Claims

1. A computer-implemented method, comprising:
- obtaining a plurality of user-provided content items that include respective audio information and video information associated with a type of annotation item;
  
  training a set of classifiers using the plurality of user-provided content items to determine a relationship between features of the plurality of user-provided content items and the type of annotation item, the set of classifiers configured to identify a customized list of a set of annotation items;
  
  receiving a content item that includes audio information and video information, the audio information being time synchronized with the video information;
  
  extracting the audio information from the content item;
  
  determining an acoustic pattern that characterizes a portion of the audio information by performing audio analysis on the audio information, wherein the acoustic pattern is associated with a temporal location in the audio information;
  
  determining a corresponding temporal location in the video information based at least in part on the temporal location of the acoustic pattern;
  
  analyzing a portion of the video information at the corresponding temporal location using the set of classifiers to determine a set of matching vectors, individual matching vectors of the set of matching vectors including weighted annotation items from the customized list of the set of annotation items;
  
  determining a weighted summation of the weighted annotation items of the set of matching vectors to generate a merged matching vector; and
  
  selecting an annotation item from the customized list of the set of annotation items from the merged matching vector associated with a highest weight, wherein the annotation item characterizes the video information at the corresponding temporal location.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. The method of claim 1, wherein the annotation items are determined without analyzing the video information to determine additional features that characterize the video information.
  - 3. The computer-implemented method of claim 1, wherein the computer-implemented method further involves receiving a link to the content item;
    - and wherein receiving the content item involves downloading the content item from the link.
  - 4. The computer-implemented method of claim 1, wherein the acoustic pattern is determined using an unsupervised learning technique.
  - 5. The computer-implemented method of claim 1, wherein determining the annotation item involves calculating weights for a set of pre-defined annotation items using the acoustic pattern as an input to one or more pre-determined supervised learning models that specify relationships between the acoustic pattern and the annotation item.
  - 6. The computer-implemented method of claim 5, wherein a given annotation item is associated with at least one given pre-determined supervised learning model that is used to calculate at least a weight associated with the given annotation item.
  - 7. The computer-implemented method of claim 6, wherein the at least one given pre-determined supervised learning model is used to calculate weights associated with the given annotation item and other annotation items;
    - andwherein the computer-implemented method further comprises combining the weights calculated by the one or more pre-determined supervised learning models to obtain the annotation items associated with the video information.
  - 8. The computer-implemented method of claim 1 wherein the computer-implemented method further involves analyzing the video information proximate to the temporal location to determine additional features that characterize the video information;
    - andwherein determining the annotation items associated with the video information is further based on the additional features.
  - 9. The computer-implemented method of claim 1, wherein the computer-implemented method further involves providing a recommendation for another content item that includes additional audio information and additional video information based on one of:
    - the annotation item and the acoustic pattern.
  - 10. The computer-implemented method of claim 1, wherein the computer-implemented method further involves providing a rating of the content item based on one of:
    - the annotation item and the acoustic pattern.
  - 11. The computer-implemented method of claim 10, wherein the rating indicates one of:
    - an estimated popularity of the content item, quality of the audio information, quality of the video information, and quality of the content item.
  - 12. The computer-implemented method of claim 1, wherein the method further involves identifying music in the audio information.
  - 13. The computer-implemented method of claim 1, wherein the computer-implemented method further involves dividing the audio information into segments;
    - wherein a given segment has associated annotation items; and
      
      wherein at least some of the associated annotation items are different from each other.
  - 14. The computer-implemented method of claim 13, wherein sets of annotation items are separately determined for the segments using one or more pre-determined supervised learning models that specify relationships between the acoustic pattern and the annotation item;
    - andwherein the computer-implemented method further comprises combining the sets of annotation items for the segments to obtain the annotation items.
  - 15. The method of claim 1, wherein retrieving a customized list of a set of annotation items further comprises selecting a subset of a predefined set of annotation items for use in determining the annotation items for the provided content items.
  - 16. The method of claim 15, wherein the predefined set of annotation items comprises at least one of a standard dictionary;
    - public user-contributed annotation items; and
      
      private user-contributed annotation items.

17. A non-transitory computer-readable storage medium storing instructions that, when executed by at least one processor, enable a computing device to:
- obtain a plurality of user-provided content items that include respective audio information and video information associated with a type of annotation item;
  
  train a set of classifiers using the plurality of user-provided content items to determine a relationship between features of the plurality of user-provided content items and the type of annotation item, the set of classifiers configured to identify a customized list of a set of annotation items;
  
  receive a content item that includes audio information and video information, the audio information being time synchronized with the video information;
  
  extract the audio information from the content item;
  
  determine an acoustic pattern that characterizes a portion of the audio information by performing audio analysis on the audio information, wherein the acoustic pattern is associated with a temporal location in the audio information;
  
  determine a corresponding temporal location in the video information based at least in part on the temporal location of the acoustic pattern;
  
  analyze a portion of the video information at the corresponding temporal location using the set of classifiers to determine a set of matching vectors, individual matching vectors of the set of matching vectors including weighted annotation items from the customized list of the set of annotation items;
  
  determine a weighted summation of the weighted annotation items of the set of matching vectors to generate a merged matching vector; and
  
  select an annotation item from the customized list of the set of annotation items from the merged matching vector associated with a highest weight, wherein the annotation item characterizes the video information at the corresponding temporal location.
- View Dependent Claims (18, 19, 20)
- - 18. The non-transitory computer-readable storage medium of claim 17, wherein the annotation items are determined without analyzing the video information to determine additional features that characterize the video information.
  - 19. The non-transitory computer-readable storage medium of claim 17, wherein determining the annotation item involves calculating weights for a set of pre-defined annotation items using the acoustic pattern as an input to one or more pre-determined supervised learning models that specify relationships between the acoustic pattern and the annotation item.
  - 20. The non-transitory computer-readable storage medium of claim 17, wherein the instructions that, when executed by the at least one processor, further enable the computing device to:
    - divide the audio information into segments, wherein a given segment has associated annotation items; and
      
      wherein at least some of the associated annotation items are different from each other.

21. A computer system, comprising:
- a processor;
  
  memory including instructions that, when executed by the processor, cause the computing system to;
  
  obtain a plurality of user-provided content items that include respective audio information and video information associated with a type of annotation item;
  
  train a set of classifiers using the plurality of user-provided content items to determine a relationship between features of the plurality of user-provided content items and the type of annotation item, the set of classifiers configured to identify a customized list of a set of annotation items;
  
  receive a content item that includes audio information and video information, the audio information being time synchronized with the video information;
  
  extract the audio information from the content item;
  
  determine an acoustic pattern that characterizes a portion of the audio information by performing audio analysis on the audio information, wherein the acoustic pattern is associated with a temporal location in the audio information;
  
  determine a corresponding temporal location in the video information based at least in part on the temporal location of the acoustic pattern; and
  
  analyze a portion of the video information at the corresponding temporal location using the set of classifiers to determine a set of matching vectors, individual matching vectors of the set of matching vectors including weighted annotation items from the customized list of the set of annotation items;
  
  determine a weighted summation of the weighted annotation items of the set of matching vectors to generate a merged matching vector; and
  
  select an annotation item from the customized list of the set of annotation items from the merged matching vector associated with a highest weight, wherein the annotation item characterizes the video information at the corresponding temporal location.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Coviello, Emanuele, Lanckriet, Gert
Primary Examiner(s)
Dang, Hung

Application Number

US14/120,575
Publication Number

US 20140363138A1
Time in Patent Office

1,146 Days
Field of Search

386239-248
US Class Current
CPC Class Codes

G11B 27/28   by using information signal...

H04N 21/4394   involving operations for an...

H04N 21/44008   involving operations for an...

H04N 21/84   Generation or processing of...

H04N 21/8456   by decomposing the content ...

Audio-based annotation of video

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Audio-based annotation of video

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links