Systems and methods for clustering of near-duplicate images in very large image collections
First Claim
1. A computer-implemented method for clustering a plurality of images, the computer-implemented method being performed in connection with a computerized system comprising a central processing unit and a memory, the computer-implemented method comprising:
- a. generating a vocabulary of visual words in the plurality of images;
b. extracting image features for image key points for each of the plurality of images;
c. based on the extracted image features, creating an index pointing from the visual words in the vocabulary to images from the plurality of images, which contain these visual words;
d. using the created index to collect all other images of the plurality of images that share at least one visual word with a selected image and determining a number of shared visual words;
e. performing a geometric verification to verify whether the shared visual words are located at same locations in the selected image and the other images of the plurality of images and taking a fraction of verified shared visual words to all shared visual words as a similarity measure; and
f. clustering the plurality of images hierarchically based on the similarity measure.
2 Assignments
0 Petitions
Accused Products
Abstract
Detection of near-duplicate images is important for detecting the reuse of copyrighted material. Some applications require the clustering of near-duplicates instead of the comparison to an original. Representing images as bags of visual words is the first step for our clustering approach. An inverted index points from visual words to all the images containing that visual word. In the next step, matches are geometrically verified in pairs of images that share a large fraction of their visual words. Geometric verification may use affine, perspective, or other transformations. The verification step provides a similarity measure based on the fraction of the matching image points and on their distributions in the compared images. The resulting distance matrix is very sparse because most images in the collection are not compared to each other. This distance matrix is used as input for modified agglomerative hierarchical clustering approach that can handle a sparse distance matrix.
-
Citations
24 Claims
-
1. A computer-implemented method for clustering a plurality of images, the computer-implemented method being performed in connection with a computerized system comprising a central processing unit and a memory, the computer-implemented method comprising:
-
a. generating a vocabulary of visual words in the plurality of images; b. extracting image features for image key points for each of the plurality of images; c. based on the extracted image features, creating an index pointing from the visual words in the vocabulary to images from the plurality of images, which contain these visual words; d. using the created index to collect all other images of the plurality of images that share at least one visual word with a selected image and determining a number of shared visual words; e. performing a geometric verification to verify whether the shared visual words are located at same locations in the selected image and the other images of the plurality of images and taking a fraction of verified shared visual words to all shared visual words as a similarity measure; and f. clustering the plurality of images hierarchically based on the similarity measure. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A computerized system for clustering a plurality of images, the computerized system comprising a central processing unit and a memory storing a set of computer-executable instructions for:
-
a. generating a vocabulary of visual words in the plurality of images; b. extracting image features for image key points for each of the plurality of images; c. based on the extracted image features, creating an index pointing from the visual words in the vocabulary to images from the plurality of images, which contain these visual words; d. using the created index to collect all other images of the plurality of images that share at least one visual word with a selected image and determining a number of shared visual words; e. performing a geometric verification to verify whether the shared visual words are located at same locations in the selected image and the other images of the plurality of images and taking a fraction of verified shared visual words to all shared visual words as a similarity measure; and f. clustering the plurality of images hierarchically based on the similarity measure. - View Dependent Claims (16, 17, 18, 19)
-
-
20. A non-transitory computer-readable medium embodying a set of computer-executable instructions, which, when executed in a computerized system comprising a central processing unit and a memory, cause the computerized system to perform a method for clustering a plurality of images, the method comprising:
-
a. generating a vocabulary of visual words in the plurality of images; b. extracting image features for image key points for each of the plurality of images; c. based on the extracted image features, creating an index pointing from the visual words in the vocabulary to images from the plurality of images, which contain these visual words; d. using the created index to collect all other images of the plurality of images that share at least one visual word with a selected image and determining a number of shared visual words; e. performing a geometric verification to verify whether the shared visual words are located at same locations in the selected image and the other images of the plurality of images and taking a fraction of verified shared visual words to all shared visual words as a similarity measure; and f. clustering the plurality of images hierarchically based on the similarity measure.
-
-
21. A computer-implemented method for clustering a plurality of content items, the computer-implemented method being performed in connection with a computerized system comprising a central processing unit and a memory, the computer-implemented method comprising:
-
a. generating a vocabulary of words in the plurality of content items; b. extracting features from the plurality of content items; c. based on the extracted features, creating an index pointing from the words in the vocabulary to content items from the plurality of content items, which contain these words; d. using the created index to collect all other content items of the plurality of content items that share at least one word with a selected content item and determining a number of shared words; e. performing a content verification to verify whether the shared words are located at same locations in the selected content item and the other content items of the plurality of content items and taking a fraction of verified shared words to all shared words as a similarity measure; and f. clustering the plurality of content items hierarchically based on the similarity measure. - View Dependent Claims (22, 23, 24)
-
Specification