×

Systems and methods for searching of storage data with reduced bandwidth requirements

  • US 8,725,705 B2
  • Filed: 07/29/2005
  • Issued: 05/13/2014
  • Est. Priority Date: 09/15/2004
  • Status: Expired due to Fees
First Claim
Patent Images

1. A computer-implemented method, comprising a similarity search followed by an identity comparison:

  • the similarity search comprising;

    at a first location, using a first computer to determine a set of first data distinguishing characteristics associated with each of a plurality of first data chunks of first data stored at the first location, wherein determining the set of first data distinguishing characteristics associated with each first data chunk includes;

    calculating a mathematical hash value of each portion of respective data;

    determining a subset of k hash values from the calculated mathematical hash values, k being a predetermined number that is smaller than a total number of the calculated mathematical hash values calculated for each portion of the respective data;

    identifying a respective data portion for each of the k hash values;

    identifying a data portion shifted by a predetermined amount relative to each respective data portion corresponding to the k hash values;

    determining a mathematical hash value for each shifted data portion from the calculated mathematical hash values; and

    setting the set of first data distinguishing characteristics to be the mathematical hash values for each of the shifted data portions to obtain a more uniform probabilistic distribution than would be obtained using the data portions corresponding to the k hash values;

    transmitting the determined sets of first data distinguishing characteristics from the first location to a remote location different than the first location;

    at the remote location, using a remote computer to compare a plurality of the determined sets of first data distinguishing characteristics to one or more sets of remote data distinguishing characteristics, and to identify one or more remote data chunks of remote data stored at the remote location that are similar to the first data based on the comparison, wherein the one or more remote data chunks is determined to be similar to the first data when a number of matching distinguishing characteristics is found in the respective sets of distinguishing characteristics for the first and remote chunks which exceeds a similarity threshold; and

    the identity comparison comprising;

    using the determined similar data chunks, determining one or more differences between the first data and the identified similar remote data, without transmitting all of the first data to the remote location and without transmitting all of the identified similar remote data to the first location.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×