Method for determining near duplicate data objects
First Claim
Patent Images
1. A method for determining that at least one data object B is a candidate for near duplicate to a data object A with a given similarity level th, comprising:
- (i) providing from a storage at least two different functions on a data object, each function having a numeric function value;
(ii) determining by a processor that at least one data object B is a candidate for near duplicate to a data object A, if a condition is met, the condition includes;
for any function ƒ
i from among said at least two functions, |ƒ
i(A)−
ƒ
i(B)|≦
δ
i(ƒ
,th), wherein δ
i is dependent upon at least ƒ
,th.
2 Assignments
0 Petitions
Accused Products
Abstract
A system for determining that a document B is a candidate for near duplicate to a document A with a given similarity level th. The system includes a storage for providing two different functions on the documents, each function having a numeric function value. The system further includes a processor associated with the storage and configured to determine that the document B is a candidate for near duplicate to the document A, if a condition is met. The condition includes: for any function ƒi from among the two functions, ƒi(A)−ƒi(B)≦δi(ƒ,A,th).
-
Citations
51 Claims
-
1. A method for determining that at least one data object B is a candidate for near duplicate to a data object A with a given similarity level th, comprising:
-
(i) providing from a storage at least two different functions on a data object, each function having a numeric function value; (ii) determining by a processor that at least one data object B is a candidate for near duplicate to a data object A, if a condition is met, the condition includes;
for any function ƒ
i from among said at least two functions, |ƒ
i(A)−
ƒ
i(B)|≦
δ
i(ƒ
,th), wherein δ
i is dependent upon at least ƒ
,th. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35)
-
-
36. A method for determining that a document A is a candidate for near duplicate to at least one other document B, comprising:
-
i) providing from a storage at least two different bounded functions fon document, and for each classifier providing a vector with n buckets where n is a function of th, each of size 1/n ii) receiving the document A, associating a unique document id to the document, and calculating a list of features by a processor; iii) calculating by the processor a rank=ƒ
(A), where A being the list of features of the documents;iv) calculating by the processor, add document id to buckets in the vector, as follows; Floor(n·
rank) (if greater than zero, otherwise discard this option), Floor(n·
rank)+1, and Floor(n·
rank)+2 (if less than n, otherwise discard this option)v) calculating union on documents id in the buckets, giving rise to set of documents id; vi) applying by the processor (ii)-(v), in respect to a different classifier from among said at least two classifiers, giving rise to respective at least two sets of documents id; vii) applying by the processor intersection to the at least two of the sets, stipulated in (vi), giving rise to at least two documents id, if any, being candidates for near duplicate. - View Dependent Claims (37, 38, 39, 40, 41, 42, 43)
-
-
44. A method for determining that at least one data object B is a candidate for near duplicate to a data object A, comprising
(i) providing from a storage at least two different functions on a data object, each function having a numeric function value; (ii) determining by a processor that at least one data object B is a candidate for near duplicate to a data object A, if a condition is met, the condition includes;
for any function ƒ
i from among said at least two functions, |ƒ
i(A)−
ƒ
i(B)|≦
δ
i(ƒ
,A), wherein δ
i is dependent upon at least ƒ and
A.- View Dependent Claims (45)
-
46. A method for determining that at least one data object B is a candidate for near duplicate to a data object A, comprising
(i) providing from a storage at least two different functions on a data object, each function having a numeric function value; (ii) determining by a processor that at least one data object B is a candidate for near duplicate to a data object A, if a condition is met, the condition includes;
for any function ƒ
i from among said at least two functions a relationship between results of the function when applied to the data objects meets a given score.- View Dependent Claims (47, 48)
-
49. A system for determining that at least one object B is a candidate for near duplicate to an object A, comprising:
-
a storage providing at least two different functions on an object, each function having a numeric function value; a processor associated with said storage and configured to determine that at least one object B is a candidate for near duplicate to an object A, if a condition is met, the condition includes;
for any function ƒ
i from among said at least two functions, |ƒ
i(A)−
ƒ
i(B)|≦
δ
i(ƒ
,A), wherein δ
i is dependent upon at least ƒ and
A. - View Dependent Claims (50)
-
-
51. A system for determining that at least one object B is a candidate for near duplicate to an object A, comprising:
- a storage providing at least two different functions on an object, each function having a numeric function value;
a processor associated with said storage, configured to determine that at least one object B is a candidate for near duplicate to an object A, if a condition is met, the condition includes;
for any function ƒ
i from among said at least two functions a relationship between results of the function when applied to the objects meets a given score.
- a storage providing at least two different functions on an object, each function having a numeric function value;
Specification