System and method for identifying objects in video

US 8,064,641 B2
Filed: 12/03/2007
Issued: 11/22/2011
Est. Priority Date: 11/07/2007
Status: Active Grant

First Claim

Patent Images

1. A method for identifying objects in a video, the method comprising:

detecting a first input probable to identify an object in one or more video frames in a video stream of the video, the first input being an image of the object;

determining one or more second inputs probable to identify the object in the video frames, wherein the second inputs comprise additional data extracted from at least one of the video stream and an accompanying audio stream of the video;

associating the second inputs with the object;

obtaining distance values between each input and a plurality of reference objects, wherein a distance value indicates a closeness of an input to an identity of a reference object;

responsive to obtaining distance values for an input, associating a relative weight with the input based on the likelihood of the input to identify the object as a reference object;

calculating joint distance values between the object and the reference objects, wherein a joint distance value is a weighted transformation of distance values between a plurality of inputs and a reference object;

comparing the joint distance values calculated for the object; and

identifying the object as a reference object based on the comparing.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for processing digital media is described. In one example embodiment, the method may include detecting an unknown object in a video frame, receiving inputs representing probable identities of the unknown object in the video frame from various sources, and associating each input with the unknown object detected in the video frame. The received inputs may be processed, compared with reference data and, based on the comparison, probable identities of the object associated with the input derived. The method may further include retrieving a likelihood of the input to match the unknown object from historical data and producing weights corresponding to the inputs, fusing the inputs and the relative weight associated with each input, and identifying the unknown object based on a comparison of the weighted distances from the unknown identify to a reference identity. The relative weights are chosen from the historical data to maximize correct recognition rate based on the history of recognitions and manual verification results.

22 Citations

View as Search Results

25 Claims

1. A method for identifying objects in a video, the method comprising:
- detecting a first input probable to identify an object in one or more video frames in a video stream of the video, the first input being an image of the object;
  
  determining one or more second inputs probable to identify the object in the video frames, wherein the second inputs comprise additional data extracted from at least one of the video stream and an accompanying audio stream of the video;
  
  associating the second inputs with the object;
  
  obtaining distance values between each input and a plurality of reference objects, wherein a distance value indicates a closeness of an input to an identity of a reference object;
  
  responsive to obtaining distance values for an input, associating a relative weight with the input based on the likelihood of the input to identify the object as a reference object;
  
  calculating joint distance values between the object and the reference objects, wherein a joint distance value is a weighted transformation of distance values between a plurality of inputs and a reference object;
  
  comparing the joint distance values calculated for the object; and
  
  identifying the object as a reference object based on the comparing.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The method of claim 1, wherein a joint distance value comprises an Euclidean or a non-Euclidean distance between vectors representing a reference object and the object in an input modality.
  - 3. The method of claim 1, wherein identifying the object as a reference object based on the comparing comprises determining a reference object having a calculated joint distance value less than or equal to a threshold identification value.
  - 4. The method of claim 1, further comprising creating index data for the object, the index data including one or more of the following for the object:
    - an identifier associated with the object, a time at which the object appears in the video stream, a spatial position of the object in one or more video frames of the video stream, and other metadata associated with the object.
  - 5. The method of claim 4, wherein the object is a person and the index data, in response to a search for a name of a person in a video, returns video beginning from a time in which the person appeared on screen.
  - 6. The method of claim 1, wherein the object is a person, and the image of the object is a facial image of the person.
  - 7. The method of claim 2, wherein the input modality is associated with a facial image of a person detected in the video frame and selected from a group consisting of an Electronic Program Guide (EPG), a television channel identifier, sub-titles accompanying the video stream, optical character recognition data (OCR), human voices detected in an accompanying audio, textual data derived from the accompanying audio, a transcript of the video stream, and the facial image.
  - 8. The method of claim 6, wherein the relative weight associated with the facial image is derived from a likelihood of the person to appear in a given category of the video stream, the likelihood based on historical data representing identifications of the person in the category of videos stream.
  - 9. The method of claim 1, wherein determining the second inputs comprises extracting text from one of:
    - analysis of the video frame, text conversion of the accompanying audio, an EPG and subtitles.
  - 10. The method of claim 1, wherein identifying the object is further based on categories of the video stream, the categories selected from a group consisting of same television (TV) set, same TV channel, same news channel, same web broadcast, same category of entertainment, social life, sports, politics, fashion, show business, finance, stock market;
    - the categories of the video stream being derived from a group consisting of EPG, a video transcript, and previously learned information.
  - 11. The method of claim 10, further comprising receiving statistics from identified objects and manually verifying and correcting incorrect indexes.
  - 12. The method of claim 6, wherein determining the second inputs comprises:
    - extracting a text block from a frame of the video stream;
      
      performing OCR on the text block to extract a text from the text block;
      
      identifying probable names of people or objects from the text;
      
      associating every object in video frames with probable names;
      
      comparing distances between the probable names and reference names, each distance being an edit distance between two names; and
      
      suggesting a name of the object based on the comparing.
  - 13. The method of claim 9, further comprising:
    - analysing data provided by an EPG identifying the video stream to obtain probable names of people;
      
      detecting a second object in the video frames;
      
      associating the first and second object in the video frames with probable names and corresponding distances to names associated with reference objects;
      
      comparing distances between the probable names and reference names, each distance being an edit distance between two names; and
      
      suggesting a name of an object based on the comparing.
  - 14. The method of claim 1, wherein calculating a joint distance value between the object and a reference object includes the following:
    - (C_w)D=sqrt(1/b_wC_w+a_w/b_w)C_wbeing computed as C_w=(w₁C₁)+(w₂C₂)+. . .(w_nC_n) for n inputs associated with the unknown object and having a distance value to the reference object, wherein a weight w_iand a confidence C_iare associated with each i input and the confidence Ci of each i input is computed from an input distance value D_ias C_i=1/(a_i+b_i·
      
      D_i²), wherein a_iand b_iare constants chosen based on historical data and where weights w_iare assigned to the inputs based on historical data to maximize a correct recognition rate, and the input distance value D_icomprises one of a distance between vectors for the unknown object and a reference object, a distance from the unknown object'"'"'s inputs probable identity to the reference object, and an edit distance.
  - 15. The method of claim 14, wherein the image of the object is a facial image of a person and Dw defines a joint distance to a facial image of a reference object in a reference database, the facial image being assumed to correctly identify the object as the reference object when Dw is less than a threshold associated with the facial image of the reference object.

16. A system for identifying objects in a video, the system comprising:
- a buffered frame sequence processor to process a plurality of video frames in a video stream of the video;
  
  a facial context extraction processor to detect and extract a first input probable to identify an object in one or more video frames in the plurality of video fames the first input being an image of the object;
  
  extraction processors to detect and extract one or more second inputs probable to identify the object in the video frames, wherein the second inputs comprise additional data extracted from at least one of the video stream and an accompanying audio stream of the video;
  
  an associating module to associate the second inputs with the object detected in the video frames and to associate a relative weight with the input based on the likelihood of the input to identify the object as a reference object;
  
  a computing module to obtain values of a distance function from the first and second inputs to reference objects, wherein a distance function value indicates a closeness of an input to an identity of a reference object, and to obtain values of a joint distance function from the object to the reference objects, wherein a joint distance function value is a weighted transformation of distance values between a plurality of inputs and a reference object;
  
  a comparing module to compare the values of the joint distance function for the object; and
  
  an identification module to identify the object as a reference object based on the comparing.
- View Dependent Claims (17, 18, 19, 20, 21, 22, 23)
- - 17. The system of claim 16, wherein a joint distance value comprises an Euclidean or a non-Euclidean distance between vectors representing a reference object and the object in an input modality.
  - 18. The system of claim 16, wherein the identification module to identify the object as a reference object based on the comparing comprises determining a reference object having a calculated joint distance value less than or equal to a threshold identification value.
  - 19. The system of claim 16, further comprising an indexing module to create indexed data for the object, the indexed data including one or more of the following for the object:
    - an identifier associated with the object, a time at which the object appears in the video stream, and a spatial position of the object in one or more video frames of the video stream, and other metadata associated with the object.
  - 20. The system of claim 19, wherein the object is a person and the indexed data, in response to a search for a name of a person in a video, returns video beginning from a time in which the person appeared on screen.
  - 21. The system of claim 16, wherein the identification module bases the identifying of the object as the reference object in part on categories of the video stream, the categories comprising same television (TV) set, same TV channel, same news channel, same web broadcast, same category of entertainment, social life, sports, politics, fashion, show business, finance, and stock market, the categories of the video stream derived from Electronic Program Guide (EPG), a video transcript or learned previously.
  - 22. The system of claim 21, wherein the categories of the video stream are determined by gathering statistics from identified objects and from manual verification and correction of incorrect indexes.
  - 23. The system of claim 16, wherein detecting and extracting the second inputs comprises:
    - an optical character recognition (OCR) content extraction processor to extract a text block from a frame of the video stream;
      
      a text-based context creation processor to perform OCR on the text block to extract text from the text block; and
      
      a names extraction processor to identify probable names of people in the text, associate every object in the video frames with probable names, compare distances between the probable names and reference names, each distance being an edit distance between two names, and to suggest a name of the object based on the comparing.

24. A method for identifying objects in a video, the method comprising;
- means of detecting a first input probable to identify an object in one or more video frames in a video stream of the video, the first input being an image of the object;
  
  means of determining one or more second inputs probable to identify the object in the video frames, wherein the second inputs comprise additional data extracted from at least one of the video stream and an accompanying audio stream of the video;
  
  means of associating the second inputs with the object;
  
  means of obtaining distance values between each input and a plurality of reference object, wherein a distance value indicates a closeness of an input to an identity of a reference object;
  
  means of associating a relative weight with an input based on the likelihood of the input to identify the object as a reference object responsive to obtaining distance value of the input;
  
  means of calculating joint distance values between the object and the reference objects, wherein a joint distance value is a weighted transformation of distance values between a plurality of inputs and a reference object;
  
  means of comparing the joint distance values calculated for the object; and
  
  means of identifying the object as a reference object based on the comparing.

25. A non-transitory machine-readable medium comprising instructions, which when implemented by one or more processors perform the following operations:
- detect a first input probable to identify an object in one or more video frames in a video stream of the video, the first input being an image of the object;
  
  determine one or more second inputs probable to identify the object in the video frames, wherein the second inputs comprise additional data extracted from at least one of the video stream and an accompanying audio stream of the video;
  
  associate the second inputs with the object;
  
  obtain distance values between each input and a plurality of reference objects, wherein a distance value indicates a closeness of an input to an identity of a reference object;
  
  responsive to obtaining distance values for an input, associate a relative weight with the input based on the likelihood of the input to identify the object as a reference object;
  
  calculate joint distance values between the object and the reference objects, wherein a joint distance value is a weighted transformation of distance values between a plurality of inputs and a reference object;
  
  compare the joint distance values calculated for the object; and
  
  identify the object as a reference object based on the comparing.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google Technology Holdings LLC (Alphabet Inc.)
Original Assignee
Viewdle, Inc.
Inventors
Milshteyn, Kostyantyn, Musatenko, Yuriy, Anchyshkin, Yegor, Matsello, Vyacheslav, Schlesinger, Mykhailo, Kovtun, Ivan, Kyyko, Volodymyr
Primary Examiner(s)
Akhavannik; Hadi

Application Number

US11/949,128
Publication Number

US 20090116695A1
Time in Patent Office

1,450 Days
Field of Search

382/100, 382/232, 382/181, 382/103
US Class Current

382/103
CPC Class Codes

G06F 16/58   Retrieval characterised by ...

G06V 20/40   in video content extracting...

G06V 20/635   Overlay text, e.g. embedded...

G06V 40/16   Human faces, e.g. facial pa...

System and method for identifying objects in video

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

22 Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for identifying objects in video

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

22 Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links