Similarity detection and clustering of images
First Claim
Patent Images
1. A computer system, comprising:
- a communication network;
a client connected to the communication network, the client having an input/output interface to submit a query;
a server connected to the communication network, the server having an input/output interface to receive a query from the client;
a database of images, wherein the data base is resident in the server;
a search module resident in the server, wherein the search module is configured to;
search the database utilizing the query to locate a set of images that match the search query;
analyze each image included in the set of images to determine whether any images included in the set are near duplicates of one another, wherein the analysis includes a pre-processing of the images included in the set of images, the pre-processing including;
gathering statistics for each image in the set of images, wherein the statistics include at least one of an aspect ratio associated with an image and a mean value for each chroma channel in the red-green-blue color space associated with an image;
scaling each image in the set of images to a uniform size; and
computing a luminance matrix for each scaled image in the set of images, wherein the luminance matrix includes a weighted sum of linear red-green-blue color components of image;
determine the number of near duplicate images included in the set of images for each image in the set of images;
determine a popularity level of a particular image in the set of images based on the number of near duplicate images for the particular image found in the set of images;
rank the images included in the set of images according to their determined popularity level with a higher popularity level placed higher in the search result; and
providing the ranked list of images to the client in response to the query.
2 Assignments
0 Petitions
Accused Products
Abstract
A system and method for determining if a set of images in a large collection of images are near duplicates allows for improved management and retrieval of images. Images are processed, image signatures are generated for each image in the set of images, and the generated image signatures are compared. Detecting similarity between images can be used to cluster and rank images.
169 Citations
17 Claims
-
1. A computer system, comprising:
-
a communication network; a client connected to the communication network, the client having an input/output interface to submit a query; a server connected to the communication network, the server having an input/output interface to receive a query from the client; a database of images, wherein the data base is resident in the server; a search module resident in the server, wherein the search module is configured to; search the database utilizing the query to locate a set of images that match the search query; analyze each image included in the set of images to determine whether any images included in the set are near duplicates of one another, wherein the analysis includes a pre-processing of the images included in the set of images, the pre-processing including; gathering statistics for each image in the set of images, wherein the statistics include at least one of an aspect ratio associated with an image and a mean value for each chroma channel in the red-green-blue color space associated with an image; scaling each image in the set of images to a uniform size; and computing a luminance matrix for each scaled image in the set of images, wherein the luminance matrix includes a weighted sum of linear red-green-blue color components of image; determine the number of near duplicate images included in the set of images for each image in the set of images; determine a popularity level of a particular image in the set of images based on the number of near duplicate images for the particular image found in the set of images; rank the images included in the set of images according to their determined popularity level with a higher popularity level placed higher in the search result; and providing the ranked list of images to the client in response to the query. - View Dependent Claims (2)
-
-
3. A method for determining near duplicate images in a set of images, the method comprising:
-
pre-processing the set of images, wherein the pre-processing includes; gathering statistics for each image in the set of images, wherein the statistics include at least one of an aspect ratio associated with an image and a mean value for each chroma channel in the red-green-blue color space associated with an image; scaling each image in the set of images to a uniform size; and computing a luminance matrix for each scaled image in the set of images, wherein the luminance matrix includes a weighted sum of linear red-green-blue color components of image; generating an image signature for each image in the set of images based on the pre-processing of the images; comparing the generated image signatures to generate an indication of similarity of images in the set of images; determining, based on the indication of similarity, whether two or more images in the set of images are near duplicates of one another; and associating an anchor corresponding to a first image to a second image determined to be a near duplicate of the first image, wherein the anchor is a word in text surrounding the first image. - View Dependent Claims (4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A method for clustering a set of images, the method comprising:
-
pre-processing the set of images, wherein the pre-processing includes; gathering statistics for each image in the set of images, wherein the statistics include at least one of an aspect ratio associated with an image and a mean value for each chroma channel in the red-green-blue color space associated with an image; scaling each image in the set of images to a uniform size; and computing a luminance matrix for each scaled image in the set of images, wherein the luminance matrix includes a weighted sum of linear red-green-blue color components of an image; generating an image signature for each image, based on the pre-processing; comparing the image signatures to determine an indicator of similarity between the images, wherein the indicator of similarity is used to cluster images; defining an image directed graph having the set of images as a set of vertices and the indicator of similarity as edges between the vertices, wherein the edges are annotated with a weight that represents a level of similarity between images; and linking the image directed graph with one or more layers of a graph comprising multiple layers, the layers including a web page directed graph having a set of web pages as a set of vertices and hyperlinks between the web pages as edges between the vertices, a click-through web page directed graph having a set of web pages selected in response to a query as a set of vertices and a selection of a subset from the set of web pages selected in response to the query as edges between the vertices, and a click-through image directed graph having a set of images selected in response to a query as a set of vertices and a selection of a subset from the set of images selected in response to the query as edges between the vertices. - View Dependent Claims (14)
-
-
15. A server implementing in hardware a search engine for determining whether two images are near duplicates, the search engine comprising:
-
a processor configured to pre-process the set of images, wherein the pre-processing includes gathering statistics for each image in the set of images, wherein the statistics include at least one of an aspect ratio associated with an image and a mean value for each chroma channel in the red-green-blue color space associated with an image; an image scalar configured to scale scaling each image in the set of images to a uniform size; a luminance matrix processor configured to compute a luminance matrix for each scaled image in the set of images, wherein the luminance matrix includes a weighted sum of linear red-green-blue color components of an image; an image signature generator configured to generate an image signature for each image; and a comparison facility configured to determine whether the two images are near duplicate images based on their image signatures, the search engine associating an anchor corresponding to the first image to the second image when determined to be a near duplicate of the first image, wherein the anchor is a word in text surrounding the image. - View Dependent Claims (16, 17)
-
Specification