Similarity detection and clustering of images

US 7,801,893 B2
Filed: 09/30/2005
Issued: 09/21/2010
Est. Priority Date: 09/30/2005
Status: Expired due to Fees

First Claim

Patent Images

1. A computer system, comprising:

a communication network;

a client connected to the communication network, the client having an input/output interface to submit a query;

a server connected to the communication network, the server having an input/output interface to receive a query from the client;

a database of images, wherein the data base is resident in the server;

a search module resident in the server, wherein the search module is configured to;

search the database utilizing the query to locate a set of images that match the search query;

analyze each image included in the set of images to determine whether any images included in the set are near duplicates of one another, wherein the analysis includes a pre-processing of the images included in the set of images, the pre-processing including;

gathering statistics for each image in the set of images, wherein the statistics include at least one of an aspect ratio associated with an image and a mean value for each chroma channel in the red-green-blue color space associated with an image;

scaling each image in the set of images to a uniform size; and

computing a luminance matrix for each scaled image in the set of images, wherein the luminance matrix includes a weighted sum of linear red-green-blue color components of image;

determine the number of near duplicate images included in the set of images for each image in the set of images;

determine a popularity level of a particular image in the set of images based on the number of near duplicate images for the particular image found in the set of images;

rank the images included in the set of images according to their determined popularity level with a higher popularity level placed higher in the search result; and

providing the ranked list of images to the client in response to the query.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for determining if a set of images in a large collection of images are near duplicates allows for improved management and retrieval of images. Images are processed, image signatures are generated for each image in the set of images, and the generated image signatures are compared. Detecting similarity between images can be used to cluster and rank images.

169 Citations

17 Claims

1. A computer system, comprising:
- a communication network;
  
  a client connected to the communication network, the client having an input/output interface to submit a query;
  
  a server connected to the communication network, the server having an input/output interface to receive a query from the client;
  
  a database of images, wherein the data base is resident in the server;
  
  a search module resident in the server, wherein the search module is configured to;
  
  search the database utilizing the query to locate a set of images that match the search query;
  
  analyze each image included in the set of images to determine whether any images included in the set are near duplicates of one another, wherein the analysis includes a pre-processing of the images included in the set of images, the pre-processing including;
  
  gathering statistics for each image in the set of images, wherein the statistics include at least one of an aspect ratio associated with an image and a mean value for each chroma channel in the red-green-blue color space associated with an image;
  
  scaling each image in the set of images to a uniform size; and
  
  computing a luminance matrix for each scaled image in the set of images, wherein the luminance matrix includes a weighted sum of linear red-green-blue color components of image;
  
  determine the number of near duplicate images included in the set of images for each image in the set of images;
  
  determine a popularity level of a particular image in the set of images based on the number of near duplicate images for the particular image found in the set of images;
  
  rank the images included in the set of images according to their determined popularity level with a higher popularity level placed higher in the search result; and
  
  providing the ranked list of images to the client in response to the query.
- View Dependent Claims (2)
- - 2. The system recited in claim 1, wherein the search engine is configured to cluster images determined as being near duplicates of each other.

3. A method for determining near duplicate images in a set of images, the method comprising:
- pre-processing the set of images, wherein the pre-processing includes;
  
  gathering statistics for each image in the set of images, wherein the statistics include at least one of an aspect ratio associated with an image and a mean value for each chroma channel in the red-green-blue color space associated with an image;
  
  scaling each image in the set of images to a uniform size; and
  
  computing a luminance matrix for each scaled image in the set of images, wherein the luminance matrix includes a weighted sum of linear red-green-blue color components of image;
  
  generating an image signature for each image in the set of images based on the pre-processing of the images;
  
  comparing the generated image signatures to generate an indication of similarity of images in the set of images;
  
  determining, based on the indication of similarity, whether two or more images in the set of images are near duplicates of one another; and
  
  associating an anchor corresponding to a first image to a second image determined to be a near duplicate of the first image, wherein the anchor is a word in text surrounding the first image.
- View Dependent Claims (4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 4. The method recited in claim 3, wherein the set of images is at least a subset of images obtained by a search engine upon searching for an image search term.
  - 5. The method recited in claim 3, wherein generating an image signature for each images in the set of images comprises:
    - creating a signature for each image in the set of images; and
      
      reducing the size of the signature.
  - 6. The method recited in claim 3, wherein comparing the generated image signatures comprises:
    - determining distances between the generated image signatures.
  - 7. The method recited in claim 3, further comprising:
    - reducing the number of comparisons required for detecting near duplicate images within the set of images.
  - 8. The method recited in claim 7, wherein reducing the number of comparisons required for detecting near duplicate images within the set of images comprises:
    - sorting the images according to aspect ratio;
      
      defining a window of comparison over the sorted images; and
      
      reducing a number of comparisons within the window of comparisons.
  - 9. The method recited in claim 8, wherein reducing the number of comparisons required for detecting near duplicate images within the set of images located off-line comprises:
    - building clusters of images that are determined to be near duplicates;
      
      selecting a representative of each cluster of images; and
      
      merging two clusters of images when their respective representative are determined to be near duplicate images.
  - 10. The method recited in claim 3, wherein reducing the number of comparisons required for detecting near duplicate images within the set of images comprises:
    - determining two images in the set of images to be near duplicates of each other when the two images are near duplicates of a third image in the set of images.
  - 11. The method recited in claim 3, further comprising:
    - associating an attribute corresponding to a first image to a second image determined to be a near duplicate of the first image.
  - 12. The method recited in claim 3, further comprising:
    - determining a popularity level of an image in the set of images based on the number of near duplicate images in the set of images.

13. A method for clustering a set of images, the method comprising:
- pre-processing the set of images, wherein the pre-processing includes;
  
  gathering statistics for each image in the set of images, wherein the statistics include at least one of an aspect ratio associated with an image and a mean value for each chroma channel in the red-green-blue color space associated with an image;
  
  scaling each image in the set of images to a uniform size; and
  
  computing a luminance matrix for each scaled image in the set of images, wherein the luminance matrix includes a weighted sum of linear red-green-blue color components of an image;
  
  generating an image signature for each image, based on the pre-processing;
  
  comparing the image signatures to determine an indicator of similarity between the images, wherein the indicator of similarity is used to cluster images;
  
  defining an image directed graph having the set of images as a set of vertices and the indicator of similarity as edges between the vertices, wherein the edges are annotated with a weight that represents a level of similarity between images; and
  
  linking the image directed graph with one or more layers of a graph comprising multiple layers, the layers including a web page directed graph having a set of web pages as a set of vertices and hyperlinks between the web pages as edges between the vertices, a click-through web page directed graph having a set of web pages selected in response to a query as a set of vertices and a selection of a subset from the set of web pages selected in response to the query as edges between the vertices, and a click-through image directed graph having a set of images selected in response to a query as a set of vertices and a selection of a subset from the set of images selected in response to the query as edges between the vertices.
- View Dependent Claims (14)
- - 14. The method recited in claim 13, wherein linking the image directed graph with one or more layers of the graph comprising multiple layers comprises:
    - determining edges between two images in the image graph based on the edges between web pages comprising the two images in any one of the web page directed graph, the click-through web page directed graph, and the click-through image directed graph.

15. A server implementing in hardware a search engine for determining whether two images are near duplicates, the search engine comprising:
- a processor configured to pre-process the set of images, wherein the pre-processing includes gathering statistics for each image in the set of images, wherein the statistics include at least one of an aspect ratio associated with an image and a mean value for each chroma channel in the red-green-blue color space associated with an image;
  
  an image scalar configured to scale scaling each image in the set of images to a uniform size;
  
  a luminance matrix processor configured to compute a luminance matrix for each scaled image in the set of images, wherein the luminance matrix includes a weighted sum of linear red-green-blue color components of an image;
  
  an image signature generator configured to generate an image signature for each image; and
  
  a comparison facility configured to determine whether the two images are near duplicate images based on their image signatures, the search engine associating an anchor corresponding to the first image to the second image when determined to be a near duplicate of the first image, wherein the anchor is a word in text surrounding the image.
- View Dependent Claims (16, 17)
- - 16. The server recited in claim 15, wherein the image signature generator further comprises:
    - an image processor configured to compute an aspect ratio and a mean value for each primary color chroma channel.
  - 17. The server recited in claim 16, wherein the image signature generator further comprises:
    - a wavelet transformer configured to generate a wavelet signature for each image.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
IAC Search & Media Incorporated (Match Group, Inc.)
Original Assignee
IAC Search & Media Incorporated (Match Group, Inc.)
Inventors
Liu, Xin, Choksi, Ankur, Tanganelli, Filippo, Carnevale, Luigi, Gulli', Antonino, Li, Beitao, Yang, Tao, Savona, Antonio
Primary Examiner(s)
Jalil; Neveen Abel
Assistant Examiner(s)
Hoang; Son T

Application Number

US11/242,390
Publication Number

US 20070078846A1
Time in Patent Office

1,817 Days
Field of Search

None
US Class Current

707/737
CPC Class Codes

G06F 16/583   using metadata automaticall...

G06F 16/951   Indexing; Web crawling tech...

G06V 10/56   relating to colour

G06V 10/7515   Shifting the patterns to ac...

Similarity detection and clustering of images

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

169 Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Similarity detection and clustering of images

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

169 Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links