Data hashing method, data processing method, and data processing system using similarity-based hashing algorithm
First Claim
1. A data hashing method using a similarity-based hashing (SBH) algorithm, the data hashing method comprising:
- receiving computerized data; and
generating a hash value of the computerized data using the SBH algorithm in which two data are the same if calculated hash values are the same and two data are similar if the difference of calculated hash values is small,wherein the hash value has at least two variable values that allows for a quick search of the computerized data for determining if the two data are similar, wherein the generating of the hash value of the computerized data using the SBH algorithm comprises;
calculating a fingerprint value from the content of the computerized data;
changing a component value of an Nth-order hash vector to correspond to the fingerprint value according to a predetermined rule;
determining whether the entire amount of the content of the computerized data has been processed; and
if it is determined that the entire amount of the content of the computerized data has been processed, converting the Nth-order hash vector to the hash value, andwherein the calculating of the fingerprint value comprises;
extracting a shingle, which is a continuous or discontinuous byte-string having a predetermined length, from the computerized data; and
generating a fingerprint value using a data hashing algorithm which satisfies uniformity and randomness criteria for the shingle and has a low possibility of collision.
1 Assignment
0 Petitions
Accused Products
Abstract
A data hashing method, a data processing method, and a data processing system using a similarity-based hashing (SBH) algorithm in which the same hash value is calculated for the same data and the more similar data, the smaller difference in the generated hash values. The data hashing method includes receiving computerized data, and generating a hash value of the computerized data using the SBH algorithm in which two data are the same if calculated hash values are the same and two data are similar if the difference of calculated hash values is small, wherein a search, comparison, and classification of data may be quickly processed within a time complexity of O(1) or O(n) since the similarity/closeness of data content are quantified by component values for each of the respective corresponding generated hash values.
-
Citations
23 Claims
-
1. A data hashing method using a similarity-based hashing (SBH) algorithm, the data hashing method comprising:
-
receiving computerized data; and generating a hash value of the computerized data using the SBH algorithm in which two data are the same if calculated hash values are the same and two data are similar if the difference of calculated hash values is small, wherein the hash value has at least two variable values that allows for a quick search of the computerized data for determining if the two data are similar, wherein the generating of the hash value of the computerized data using the SBH algorithm comprises; calculating a fingerprint value from the content of the computerized data; changing a component value of an Nth-order hash vector to correspond to the fingerprint value according to a predetermined rule; determining whether the entire amount of the content of the computerized data has been processed; and if it is determined that the entire amount of the content of the computerized data has been processed, converting the Nth-order hash vector to the hash value, and wherein the calculating of the fingerprint value comprises; extracting a shingle, which is a continuous or discontinuous byte-string having a predetermined length, from the computerized data; and generating a fingerprint value using a data hashing algorithm which satisfies uniformity and randomness criteria for the shingle and has a low possibility of collision. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. A data processing system using a computer-readable medium in association with a computing device which includes a processor and a memory, the computer readable medium including computer instructions which are configured to cause the computing device to perform a similarity-based hashing (SBH) algorithm, the data processing system comprising:
-
an inputting unit to which computerized data is input; a hash value generator generating a hash value of the input computerized data using the SBH algorithm; and a data processing unit processing the computerized data using hash values, wherein the SBH algorithm further comprises; calcuating a plurality of fingerprint values from the content of the computerized data; creating an Nth-order hash vector corresponding to the fingerprint values according to a predetermined rule; and converting the Nth-order hash vector into the hash value, and wherein the computerized data are packets transmitted through a network, hash values corresponding to the packets are listed in a hash value table, and the data processing unit monitors or blocks a rapid increase of packets that are the same as or similar to a specific packet by checking whether the number of specific hash values or hash values that have a difference compared to the specific hash value within a predetermined range is greater than a threshold number. - View Dependent Claims (19, 20, 21, 22, 23)
-
Specification