Associating media with metadata of near-duplicates

US 9,703,782 B2
Filed: 05/28/2010
Issued: 07/11/2017
Est. Priority Date: 05/28/2010
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

retrieving a plurality of media objects responsive to a query media object presented to a search engine;

extracting first visual words from the query media object, at least one of the first visual words being a vector quantization of a visual feature extracted from a media object;

generating an inverted index mapping a plurality of visual words corresponding to individual media objects of the plurality of media objects;

identifying near-duplicate media objects from the plurality of media objects based at least on analyzing the first visual words with respect to the inverted index and retrieving the individual media objects having at least one of the plurality of visual words with similarities to the first visual words greater than a predetermined threshold;

extracting metadata from the near-duplicate media objects to form extracted metadata;

storing the extracted metadata in a datastore as a set of metadata;

increasing the set of metadata in the datastore based, at least in part, on a synonym dictionary;

mining the set of metadata in the datastore to produce consolidated extracted metadata, wherein the mining the set of metadata includes utilizing a globalization data store, which maps terms from a first language to analogous terms in a second language;

evaluating the consolidated extracted metadata to determine one or more metadata items that are common among the near-duplicate media objects; and

associating the one or more metadata items that are common among the near-duplicate media objects with the query media object as one or more descriptors of the query media object to enable discovery of the query media object based on the one or more descriptors.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques for identifying near-duplicates of a media object and associating metadata of the near-duplicates with the media object are described herein. One or more devices implementing the techniques are configured to identify the near duplicates based at least on similarity attributes included in the media object. Metadata is then extracted from the near-duplicates and is associated with the media object as descriptors of the media object to enable discovery of the media object based on the descriptors.

90 Citations

View as Search Results

19 Claims

1. A method comprising:
- retrieving a plurality of media objects responsive to a query media object presented to a search engine;
  
  extracting first visual words from the query media object, at least one of the first visual words being a vector quantization of a visual feature extracted from a media object;
  
  generating an inverted index mapping a plurality of visual words corresponding to individual media objects of the plurality of media objects;
  
  identifying near-duplicate media objects from the plurality of media objects based at least on analyzing the first visual words with respect to the inverted index and retrieving the individual media objects having at least one of the plurality of visual words with similarities to the first visual words greater than a predetermined threshold;
  
  extracting metadata from the near-duplicate media objects to form extracted metadata;
  
  storing the extracted metadata in a datastore as a set of metadata;
  
  increasing the set of metadata in the datastore based, at least in part, on a synonym dictionary;
  
  mining the set of metadata in the datastore to produce consolidated extracted metadata, wherein the mining the set of metadata includes utilizing a globalization data store, which maps terms from a first language to analogous terms in a second language;
  
  evaluating the consolidated extracted metadata to determine one or more metadata items that are common among the near-duplicate media objects; and
  
  associating the one or more metadata items that are common among the near-duplicate media objects with the query media object as one or more descriptors of the query media object to enable discovery of the query media object based on the one or more descriptors.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method of claim 1, wherein the query media object is selected from the following:
    - a still image, a video file and an audio file.
  - 3. The method of claim 1, wherein the identifying the near-duplicate media objects includes utilizing a previously prepared index of near-duplicates.
  - 4. The method of claim 1, wherein the extracting metadata further comprises one or more of parsing a filename, extracting metatags, extracting surrounding text, extracting annotations, or extracting commentary.
  - 5. The method of claim 1, wherein the extracting metadata further comprises:
    - applying a first metadata extraction technique to extract first metadata,applying a second metadata extraction technique to extract second metadata,reconciling the first metadata and the second metadata into identified metadata suitable for the mining the extracted metadata to determine the one or more metadata items that are common among the near-duplicate media objects.
  - 6. The method of claim 5, wherein the mining the extracted metadata further comprises at least one of search result clustering or majority voting.
  - 7. The method of claim 5, wherein the mining comprises:
    - applying a first key term mining technique to mine a first key term set comprising at least one key term,applying a second key term mining technique to mine a second key term set comprising at least one key term, andreconciling the first key term set and the second key term set into the one or more metadata items suitable for associating with the query media object as descriptors.
  - 8. The method of claim 5, wherein the mining the extracted metadata further includes utilizing an ontology.
  - 9. The method of claim 5, wherein either the identifying metadata or the mining includes utilizing a machine learning module comprising:
    - at least one learning routine,at least one rule generated from the at least one learning routine, anda rules engine.
  - 10. The method of claim 1, further comprising:
    - receiving a query, the query comprising an identifier for the query media object; and
      
      extracting the similarity attributes of the query media object to enable the identifying.
  - 11. The method of claim 1, further comprising:
    - receiving a query, the query comprising one or more key terms;
      
      extracting one or more key terms;
      
      retrieving the query media object based at least on the one or more key terms; and
      
      extracting the similarity attributes of the query media object to enable the identifying.
  - 12. The method of claim 11 wherein the extracting the one or more key terms includes utilizing a parser and a grammar.
  - 13. The method of claim 1, wherein the method is performed during an on-line, interactive session.

14. A computer-implemented method comprising:
- retrieving a first media object from a first location specified by a location specifier comprising one or more locations of media objects;
  
  extracting first visual words from the first media object, at least one of the first visual words being a vector quantization of a visual feature extracted from a media object;
  
  storing first visual words from the first media object;
  
  determining that the first visual words indicate the first media object is a near-duplicate of a second media object and a third media object stored at a second location specified by the location specifier based in part on analyzing the first visual words of the first media object with respect to second visual words of the second media object and third visual words of the third media object, the second visual words and the third visual words having similarities to the first visual words greater than a predetermined threshold;
  
  storing metadata associated with the second media object and the third media object in a datastore as a set of metadata;
  
  increasing the set of metadata based, at least in part, on a synonym dictionary; and
  
  in response to determining that 4 the first media object is a near-duplicate of the second media object and the third media object;
  
  mining the set of metadata to produce consolidated metadata, wherein the mining the set of metadata includes utilizing a globalization data store, which maps terms from a first language to analogous terms in a second language;
  
  evaluating the consolidated metadata to determine one or more key terms that are common to both the second media object and the third media object; and
  
  associating the one or more key terms that are common to both the second media object and the third media object with the first media object.
- View Dependent Claims (15)
- - 15. The method of claim 14, wherein the location specifier is a list of fully qualified paths of multimedia files.

16. A computer system comprising a processor and memory to store computer-executable instructions that, when executed by the processor, perform operations including:
- retrieving a plurality of media objects responsive to a query media object presented to a search engine;
  
  extracting first visual words from the query media object, at least one of the first visual words being a vector quantization of a visual feature extracted from a media object;
  
  identifying near-duplicate media objects from the plurality of media objects based at least on analyzing the first visual words with respect to a plurality of visual words corresponding to individual media objects of the plurality of media objects, the near-duplicate media objects having at least one of the plurality of visual words with similarities to the first visual words greater than a predetermined threshold;
  
  storing metadata associated with the media objects in a datastore as a set of metadata;
  
  increasing the set of metadata based, at least in part, on a synonym dictionary;
  
  mining the set of metadata associated with the near-duplicate media objects to produce consolidated metadata, wherein the mining the set of metadata includes utilizing a globalization data store, which maps terms from a first language to analogous terms in a second language;
  
  evaluating the consolidated metadata to determine one or more key terms that are common among the near-duplicate media objects, the one or more key terms previously stored in a key term data store; and
  
  associating the one or more key terms that are common among the near-duplicate media objects with the query media object as one or more descriptors of the query media object to enable discovery of the query media object based on the descriptors.
- View Dependent Claims (17, 18, 19)
- - 17. The computer system of claim 16, wherein the operations further comprise storing the associations of the one or more key terms to the query media object in an associations datastore.
  - 18. The computer system of claim 17, wherein the operations further comprise indexing the associations datastore.
  - 19. The computer system of claim 16, wherein the operations further comprise:
    - receiving feedback indicating that a key term is incorrectly associated with the query media object; and
      
      disassociating the key term that was incorrectly associated from the query media object.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Inventors
Wang, Xin-Jing, Li, Yi, Liu, Ming, Zhang, Lei, Ma, Wei-Ying
Primary Examiner(s)
GEBRESENBET, DINKU W

Application Number

US12/790,772
Publication Number

US 20110295775A1
Time in Patent Office

2,601 Days
Field of Search

707708
US Class Current
CPC Class Codes

G06F 16/48 Retrieval characterised by ...

G06F 16/58 Retrieval characterised by ...

Associating media with metadata of near-duplicates

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

90 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Associating media with metadata of near-duplicates

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

90 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links