Systems and Methods for Efficient Data Searching, Storage and Reduction

US 20090228453A1
Filed: 03/19/2009
Published: 09/10/2009
Est. Priority Date: 09/15/2004
Status: Active Grant

First Claim

Patent Images

1-115. -115. (canceled)

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods enabling search of a repository for the location of data that is similar to input data, using a defined measure of similarity, in a time that is independent of the size of the repository and linear in a size of the input data, and a space that is proportional to a small fraction of the size of the repository. The similar data segments thus located are further analyzed to determine their common (identical) data sections, regardless of the order and position of the common data sections in the repository and input, and in a time that is linear in the segment size and in constant space.

Citations

152 Claims

1-115. -115. (canceled)

116. A method in repository data for data that are similar to an input data, wherein the repository data comprises a plurality of repository data chunks and the input data comprises a plurality of input data chunks, the method comprising:
- for each repository data chunk, generating a corresponding set of repository distinguishing characteristics (RDCs);
  
  for each input data chunk, generating a corresponding set of input distinguishing characteristics (IDCs); and
  
  searching for data in the repository data that is similar to the input data by comparing the IDC s and RDCs,wherein each set of ROCs and IOCs is generated by;
  
  applying a hash function to the respective input data chunk or repository data chunk to generate a plurality of hashes, each hash comprising a hash value and a hash position within the data chunk;
  
  applying a first function to the plurality of generated hashes to identify a first subset of hashes distributed across the data chunk;
  
  applying a second function to the hash positions of the hashes of the first subset to identify a second subset of the plurality of generated hashes; and
  
  defining the second subset of hashes as the set of respective IDCs or RDCs.
- View Dependent Claims (117, 118, 119, 120, 121, 122, 123, 124, 125)
- - 117. The method of claim 116, wherein applying the second function comprises:
    - determining other hash positions as a function of the hash positions of the hashes of the first subset; and
      
      defining the hashes at the other hash positions as the second subset of hashes.
  - 118. The method of claim 116, wherein the first function comprises one or more of:
    - selecting a number of the largest hash values;
      
      selecting a number of the smallest hash values;
      
      selecting a number of the hash values closest to a median value of the generated hash values for the corresponding data chunk;
      
      selecting a number of the hash values closest to a constant value; and
      
      selecting a number of the hash values closest to a percentile value of the generated hash values for the corresponding data chunk.
  - 119. The method of claim 118, wherein the second function comprises:
    - applying a constant value to each hash position corresponding to each of the hashes of the first subset.
  - 120. The method of claim 119, wherein:
    - an absolute value of the constant value is 1.
  - 121. The method of claim 116, wherein:
    - in the step of comparing IDCs and RDCs, the number of RDCs in a set is less than the number of IDCs in a set.
  - 122. The method of claim 116, wherein:
    - the searching is conducted by comparing less than all of the IDCs to the RDCs.
  - 123. The method of claim 116, further comprising:
    - maintaining a searchable index of RDCs for the comparing step.
  - 124. The method of claim 116, further comprising:
    - determining that a similarity exists if a similarity threshold is met.
  - 125. The method of claim 116, wherein the comparing is conducted in a time independent of a size of the repository data and linear in a size of the input data.

126-1. A. The method of claim 126, wherein the comparing is conducted in a time independent of a size of the repository data and linear in a size of the input data.

128. A computer-readable medium encoded with computer-executable instructions that cause a computer to perform a method of searching in repository data for data that are similar to an input data, wherein the repository data is divided into one or more repository data chunks and the input data comprises a plurality of input data chunks, the medium comprising:
- program code for generating a corresponding set of repository distinguishing characteristics (RDCs) for each repository data chunk;
  
  program code for generating a set of input distinguishing characteristics (IDCs) for each input data chunk;
  
  and program code for searching for data in the repository data that is similar to the input data by comparing the IDCs and RDCs,wherein the program code for generating each set of IDCs and the program code for generating each set of RDCs comprises;
  
  program code for applying a hash function to the respective input data chunk or repository data chunk to generate a plurality of hashes, each hash comprising a hash value and a hash position within the data chunk;
  
  program code for applying a first function to the plurality of generated hashes to identify a first subset of hashes distributed across the data chunk;
  
  program code for applying a second function to the hash positions of the hashes of the first subset to identify a second subset of the plurality of generated hashes;
  
  and program code for defining the second subset of hashes as the set of respective IDCs or RDCs.
- View Dependent Claims (129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139)
- - 129. The computer-readable medium of claim 128, wherein the program code for applying the second function comprises:
    - program code for determining other hash positions as a function of the hash positions of the hashes of the first subset; and
      
      program code for defining the hashes at the other hash positions as the second subset of hashes.
  - 130. The computer-readable medium of claim 128, wherein the program code for applying the first function comprises one or more of:
    - program code for selecting a number of the largest hash values;
      
      program code for selecting a number of the smallest hash values;
      
      program code for selecting a number of the hash values closest to a median value of the generated hash values for the corresponding data chunk;
      
      program code for selecting a number of the hash values closest to a constant value; and
      
      program code for selecting a number of the hash values closest to a percentile value of the generated hash values for the corresponding data chunk.
  - 131. The computer-readable medium of claim 130, wherein the program code for applying the second function comprises:
    - program code for applying a constant value to each hash position corresponding to each of the hashes of the first subset.
  - 132. The computer-readable medium of claim 131, wherein:
    - an absolute value of the constant value is 1.
  - 133. The computer-readable medium of claim 128, wherein:
    - in the program code for searching, the number of ROCs in a set is less than the number of IOCs in a set.
  - 134. The computer-readable medium of claim 128, wherein:
    - the program code for searching for data in the repository data comprises program code for comparing less than all of the IOCs to the ROCs.
  - 135. The computer-readable medium of claim 128, further comprising:
    - program code for maintaining a searchable index of ROCs.
  - 136. The computer-readable medium of claim 128, further comprising:
    - program code for determining a similarity exists if a similarity threshold is met.
  - 137. The computer-readable medium of claim 128, wherein the comparing is conducted in a time independent of a size of the repository data and linear in a size of the input data.
  - 138. The computer-readable medium of claim 128, further comprising:
    - program code for determining at least one of common and noncommon sections of the input data chunk and the repository data chunk determined to be similar using the matching distinguishing characteristics to define corresponding intervals in the input data chunk and similar repository data chunk.
  - 139. The computer-readable medium of claim 138, further comprising:
    - program code for storing the noncommon sections of the input data in the repository.

140. A system for searching in repository data for data that are similar to an input data, wherein the repository data is divided into one or more repository data chunks and the input data comprises a plurality of input data chunks, the system comprising:
- means for, for each repository data chunk, generating a corresponding set of repository distinguishing characteristics (RDCs);
  
  means for generating a corresponding set of input distinguishing characteristics (IDCs) for each input data chunk; and
  
  means for searching for data in the repository data that is similar to the input data by comparing the IDCs and RDCs,wherein the means for generating the IDCs and the means for generating the RDCs each comprises;
  
  means for applying a hash function to the respective input data chunk or repository data chunk to generate a plurality of hashes, each hash comprising a hash value and a hash position within the data chunk;
  
  means for applying a first function to the plurality of generated hashes to identify a first subset of hashes distributed across the data chunk;
  
  means for applying a second function to the hash positions of the hashes of the first subset to identify a second subset of the plurality of generated hashes; and
  
  means for defining the second subset of hashes as the set of respective IDCs or RDCs.
- View Dependent Claims (141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151)
- - 141. The system of claim 140, wherein the means for applying the second function comprises:
    - means for determining other hash positions as a function of the hash positions of the hashes of the first subset; and
      
      means for defining the hashes at the other hash positions as the second subset of hashes.
  - 142. The system of claim 140, wherein the first function comprises one or more of:
    - selecting a number of the largest hash values;
      
      selecting a number of the smallest hash values;
      
      selecting a number of the hash values closest to a median value of the generated hash values for the corresponding data chunk;
      
      selecting a number of the hash values closest to a constant value; and
      
      selecting a number of the hash values closest to a percentile value of the generated hash values for the corresponding data chunk.
  - 143. The system of claim 142, wherein the second function comprises:
    - applying a constant value to each hash position corresponding to each of the hashes of the first subset.
  - 144. The system of claim 143, wherein:
    - an absolute value of the constant value is 1.
  - 145. The system of claim 140, wherein:
    - in the means for searching, the number of RDCs in a set is less than the number of IDCs in a set.
  - 146. The system of claim 140, wherein:
    - the means for searching for data in the repository data comprises means for comparing less than all of the IDCs to the RDCs.
  - 147. The system of claim 140, further comprising:
    - means for maintaining an index of RDCs and the corresponding repository data chunk.
  - 148. The system of claim 140, further comprising:
    - means for determining that a similarity exists if a similarity threshold is met.
  - 149. The system of claim 140, wherein the comparing is conducted in a time independent of a size of the repository data and linear in a size of the input data.
  - 150. The system of claim 140, further comprising:
    - means for determining at least one of common and noncommon sections of the input data chunk and the repository data chunk determined to be similar using the matching distinguishing characteristics to define corresponding intervals in the input data chunk and similar repository data chunk.
  - 151. The system of claim 150, further comprising:
    - means for storing the noncommon sections in the repository.

152-186. -186. (canceled)

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Bitner, Haim, Asher, Ron, Aronovich, Lior, Hirsch, Michael, Klein, Shmuel T., Bachmat, Eitan

Granted Patent

US 9,430,486 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 11/1448   Management of the data invo...

G06F 11/1453   using de-duplication of the...

G06F 16/137   Hash-based content-based in...

G06F 16/1744   using compression, e.g. spa...

G06F 16/2255   Hash tables

G06F 16/2455   Query execution

G06F 2201/80   Database-specific techniques

G06F 2201/805   Real-time

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99953   Recoverability

Systems and Methods for Efficient Data Searching, Storage and Reduction

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

152 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and Methods for Efficient Data Searching, Storage and Reduction

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

152 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links