Systems and methods for efficient data searching, storage and reduction

US 8,275,782 B2
Filed: 03/19/2009
Issued: 09/25/2012
Est. Priority Date: 09/15/2004
Status: Expired due to Fees

First Claim

Patent Images

1. A system for providing input data to a repository to search repository data in the repository for data that are similar to the input data, the input data being divided into one or more input chunks, the system comprising:

a data processor and a memory storing instructions for, for each input chunk, calculating a corresponding set of input distinguishing characteristics (IDCs), each set of IDCs comprising a plurality of distinguishing characteristics, said data processor being configured to partition the respective input chunk into a plurality of seeds, each seed being a smaller part of the respective input chunk and ordered in a seed sequence and to apply a hash function to each of the seeds to generate a plurality of hash values wherein each seed yields one hash value, characterized in that;

said memory storing instructions configured to cause the data processor to select a subset (k) of the plurality of hash values;

determine positions of the seeds within the seed sequence corresponding to the selected subset of hash values;

apply a function to the determined positions to determine corresponding other positions within the seed sequence; and

define the set of distinguishing characteristics as the hash values of the seeds at the determined other positions.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods enabling search of a repository for the location of data that is similar to input data, using a defined measure of similarity, in a time that is independent of the size of the repository and linear in a size of the input data, and a space that is proportional to a small fraction of the size of the repository. The similar data segments thus located are further analyzed to determine their common (identical) data sections, regardless of the order and position of the common data sections in the repository and input, and in a time that is linear in the segment size and in constant space.

93 Citations

View as Search Results

7 Claims

1. A system for providing input data to a repository to search repository data in the repository for data that are similar to the input data, the input data being divided into one or more input chunks, the system comprising:
- a data processor and a memory storing instructions for, for each input chunk, calculating a corresponding set of input distinguishing characteristics (IDCs), each set of IDCs comprising a plurality of distinguishing characteristics, said data processor being configured to partition the respective input chunk into a plurality of seeds, each seed being a smaller part of the respective input chunk and ordered in a seed sequence and to apply a hash function to each of the seeds to generate a plurality of hash values wherein each seed yields one hash value, characterized in that;
  
  said memory storing instructions configured to cause the data processor to select a subset (k) of the plurality of hash values;
  
  determine positions of the seeds within the seed sequence corresponding to the selected subset of hash values;
  
  apply a function to the determined positions to determine corresponding other positions within the seed sequence; and
  
  define the set of distinguishing characteristics as the hash values of the seeds at the determined other positions.
- View Dependent Claims (2, 3)
- - 2. The system of claim 1, wherein:
    - the subset (k) of hash values is selected by identifying the k largest hash values; and
      
      the function applied to determine the corresponding other positions is to identify a next seed in the seed(s) sequence.
  - 3. The system of claim 1, wherein the data processor includes a central processing unit.

4. A method for providing input data to a repository to search repository data in the repository for data that is similar to the input data, the method comprising:
- dividing the input data into one or more input chunks;
  
  calculating a set of input distinguishing characteristics (IDCs) for each chunk, the set of input distinguishing characteristics comprising a plurality of characteristics and being obtained by;
  
  partitioning the respective input chunk into a plurality of seeds (s), each seed being a smaller part of the respective input chunk and ordered in a seed sequence;
  
  applying a hash function to each of the seeds to generate a plurality of hash values wherein each seed yields one hash value;
  
  selecting a subset (k) of the plurality of hash values;
  
  determining positions of the seeds within the seed sequence corresponding to the selected subset of hash values;
  
  applying a function to the determined positions to determine corresponding otherpositions within the seed sequence; and
  
  defining the set of distinguishing characteristics as the hash values of the seeds at the determined other positions.
- View Dependent Claims (5, 6, 7)
- - 5. The method of claim 4, wherein the method is performed at least in part by a data processor.
  - 6. The method of claim 4, wherein:
    - the subset (k) of hash values is selected by identifying the k largest hash values; and
      
      the function applied to determine the corresponding other positions is to identify a next seed in the seed sequence.
  - 7. A non-transitory computer-readable medium encoded with computer executable instructions that cause a computer to perform a method of searching in repository data for data that is similar to an input data, wherein the repository data is divided into one or more repository chunks, the method comprising the steps of claim 4.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Hirsch, Michael, Bitner, Haim, Aronovich, Lior, Asher, Ron, Bachmat, Eitan, Klein, Shmuel T.
Primary Examiner(s)
Lewis, Alicia

Application Number

US12/407,786
Publication Number

US 20090228454A1
Time in Patent Office

1,286 Days
Field of Search

None
US Class Current

707/758
CPC Class Codes

G06F 11/1448   Management of the data invo...

G06F 11/1453   using de-duplication of the...

G06F 16/137   Hash-based content-based in...

G06F 16/1744   using compression, e.g. spa...

G06F 16/2255   Hash tables

G06F 16/2455   Query execution

G06F 2201/80   Database-specific techniques

G06F 2201/805   Real-time

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99953   Recoverability

Systems and methods for efficient data searching, storage and reduction

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

93 Citations

7 Claims

Specification

Use Cases

Quick Links

Others

Systems and methods for efficient data searching, storage and reduction

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

93 Citations

7 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others