Use of hash values for identification and location of content

US 8,171,004 B1
Filed: 04/05/2007
Issued: 05/01/2012
Est. Priority Date: 04/20/2006
Status: Expired due to Fees

First Claim

Patent Images

1. A method, comprising:

retrieving a hostname, the hostname being evaluated to determine if an address is associated with the hostname;

detecting the address, the address being further processed to download a first file identified by the address if the first file is associated with the hostname or storing the address if the address points to another hostname;

identifying a standardized portion of the first file by identifying a data set to be selected consistently from the first file, wherein the data set is identified based on a size and a location associated with the first file;

running a first hashing algorithm against the standardized portion of the first file to generate a first hash value;

determining whether the first hash value is the same as or substantially similar to another hash value associated with a standardized portion of a second file;

in response to determining that the first hash value is the same as or substantially similar to the hash value associated with the standardized portion of the second file, running a second hashing algorithm different from the first hashing algorithm against the standardized portion of the first file to generate a second hash value different from the first hash value; and

storing the second hash value and the address associated with the first file.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Surrogate hashing is described, including initializing one or more variables in a collection, evaluating an address associated with a host, comparing the address to the collection to determine if the address is stored in the collection, and processing the address to hash a file identified by the address if the address is not stored in the collection or determining if another address is indicated by the address if the address is stored in the collection.

139 Citations

View as Search Results

39 Claims

1. A method, comprising:
- retrieving a hostname, the hostname being evaluated to determine if an address is associated with the hostname;
  
  detecting the address, the address being further processed to download a first file identified by the address if the first file is associated with the hostname or storing the address if the address points to another hostname;
  
  identifying a standardized portion of the first file by identifying a data set to be selected consistently from the first file, wherein the data set is identified based on a size and a location associated with the first file;
  
  running a first hashing algorithm against the standardized portion of the first file to generate a first hash value;
  
  determining whether the first hash value is the same as or substantially similar to another hash value associated with a standardized portion of a second file;
  
  in response to determining that the first hash value is the same as or substantially similar to the hash value associated with the standardized portion of the second file, running a second hashing algorithm different from the first hashing algorithm against the standardized portion of the first file to generate a second hash value different from the first hash value; and
  
  storing the second hash value and the address associated with the first file.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 16, 17, 18, 19)
- - 2. The method of claim 1, further comprising retrieving the address if the address points to another hostname.
  - 3. The method of claim 1, further comprising:
    - retrieving the address if the address points to another hostname; and
      
      processing the address to download another file from the another hostname.
  - 4. The method of claim 1, further comprising:
    - retrieving the address if the address points to another hostname;
      
      processing the address to download another file from the another hostname;
      
      running a third hashing algorithm against another portion of the another file to generate a third hash value for the another file, and running a fourth hashing algorithm against the another portion of the another file to generate a fourth hash value for the another file;
      
      generating a fifth hash value for the another file by concatenating the third hash value of the another file with the fourth hash value of the another file; and
      
      storing the fifth hash value and the address associated with the another file.
  - 5. The method of claim 1, further comprising registering a crawler with a storage facility.
  - 6. The method of claim 1, further comprising evaluating the hostname to determine if another file should be downloaded.
  - 7. The method of claim 1, further comprising evaluating the hostname to identify one or more other files, the one or more other files being downloaded.
  - 8. The method of claim 7, wherein the one or more other files being downloaded are used to generate a plurality of hash values by running a third hashing algorithm against a first portion of each of the one or more other files to generate a third hash value for each of the one or more other files, running a fourth hashing algorithm against a second portion of each of the one or more other files to generate a fourth hash value for each of the one or more other files, and generating a fifth hash value for each of the one or more other files by concatenating the respective third hash value and the respective fourth hash value.
  - 9. The method of claim 8, wherein the fifth hash value for each of the one or more other files are stored in a storage facility.
  - 10. The method of claim 1, wherein storing the address further comprises storing the address with the second hash value.
  - 11. The method of claim 1, further comprising comparing the address to a collection of one or more other addresses.
  - 16. The method of claim 1, further comprising determining whether the second hash value matches a stored hash value.
  - 17. The method of claim 16, further comprising determining a location of another file that is associated with the stored hash value when the second hash value matches the stored hash value.
  - 18. The method of claim 16, wherein determining whether the second hash value matches the stored hash value further comprises determining whether a portion of the second hash value matches the stored hash value, and the portion corresponds to the first hash value.
  - 19. The method of claim 16, wherein the second hash value matches the stored hash value when the second hash value and the stored hash value are substantially similar.

12. A method for file identification, comprising:
- evaluating an address associated with a hostname;
  
  comparing the address to the collection to determine if the address is stored in the collection;
  
  processing the address to download a first file identified by the address if the address is not stored in the collection or determining if another address is indicated by the address if the address is stored in the collection;
  
  identifying a standardized portion of the first file by identifying a data set to be selected consistently from the first file, wherein the data set is identified based on a size and a location associated with the first file;
  
  running a first hashing algorithm against the standardized portion of the first file to generate a first hash value;
  
  determining whether the first hash value is the same as or substantially similar to another hash value associated with a standardized portion of a second file; and
  
  in response to determining that the first hash value is the same as or substantially similar to the hash value associated with the standardized portion of the second file, running a second hashing algorithm different from the first hashing algorithm against the standardized portion of the first file to generate a second hash value different from the first hash value.
- View Dependent Claims (13, 14)
- - 13. The method of claim 12, wherein processing the address further comprises storing the address if the address is associated with another hostname.
  - 14. The method of claim 12, further comprising storing the address, the first hash value, and the second hash value.

15. A computer program product embodied in a non-transitory computer readable medium and comprising computer instructions for:
- retrieving a hostname, the hostname being evaluated to determine if an address is associated with the hostname;
  
  detecting the address, the address being further processed to download a first file identified by the address if the first file is associated with the hostname or storing the address if the address points to another hostname;
  
  identifying a standardized portion of the first file by identifying a data set to be selected consistently from the first file, wherein the data set is identified based on a size and a location associated with the first file;
  
  running a first hashing algorithm against the standardized portion of the first file to generate a first hash valuedetermining whether the first hash value is the same as or substantially similar to another hash value associated with a standardized portion of a second file;
  
  in response to determining that the first hash value is the same as or substantially similar to the hash value associated with the standardized portion of the second file, running a second hashing algorithm different from the first hashing algorithm against the standardized portion of the first file to generate a second hash value different from the first hash value; and
  
  storing the address associated with the first file, the first hash value, and the second hash value.

20. A system, comprising:
- at least one processor and memory configured to;
  
  evaluate an address associated with a hostname;
  
  compare the address to the collection to determine if the address is stored in the collection;
  
  process the address to download a first file identified by the address if the address is not stored in the collection or determining if another address is indicated by the address if the address is stored in the collection;
  
  identify a standardized portion of the first file by identifying a data set to be selected consistently from the first file, wherein the data set is identified based on a size and a location associated with the first file;
  
  run a first hashing algorithm against the standardized portion of the first file to generate a first hash valuedetermine whether the first hash value is the same as or substantially similar to another hash value associated with a standardized portion of a second file; and
  
  in response to determining that the first hash value is the same as or substantially similar to the hash value associated with the standardized portion of the second file, run a second hashing algorithm different from the first hashing algorithm against the standardized portion of the first file to generate a second hash value.

21. A system, comprising:
- a memory configured to be in data communication with a processor; and
  
  a processor configured;
  
  to retrieve a hostname, the hostname being evaluated to determine if an address is associated with the hostname;
  
  to detect the address, the address being further processed to download a first file identified by the address if the first file is associated with the hostname or storing the address if the address points to another hostname;
  
  to identify a standardized portion of the first file by identifying a data set to be selected consistently from the first file, wherein the data set is identified based on a size and a location associated with the first file;
  
  to run a first hashing algorithm against the standardized portion of the first file to generate a first hash value;
  
  to determine whether the first hash value is the same as or substantially similar to another hash value associated with a standardized portion of a second file;
  
  to run a second hashing algorithm different from the first hashing algorithm against the standardized portion of the first file to generate a second hash value different from the first hash value in response to determining that the first hash value is the same as or substantially similar to the hash value associated with the standardized portion of the second file; and
  
  to store the second hash value and the address associated with the first file.
- View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30)
- - 22. The system of claim 21, further comprising a crawler instance registered with the memory.
  - 23. The system of claim 21, wherein the memory is a database.
  - 24. The system of claim 21, wherein the processor is further configured to process the address if the address points to another hostname and to download a file from the another hostname.
  - 25. The system of claim 21, wherein the processor is further configured to process the address if the address points to another hostname, to download another file from the another hostname, to identify another standardized portion of data contents associated with the another file by identifying a data set to be selected consistently from the another file, wherein the data set is identified based on a size and a location associated with the another file, to run a third hashing algorithm against the another standardized portion of data contents to generate a third hash value for the another file, and to run a fourth hashing algorithm against the another standardized portion of data contents to generate a fourth hash value, to generate a fifth hash value for the another file by concatenating the third hash value of the another file and the fourth hash value of the another file, and to store the fifth hash value and the address associated with the another file.
  - 26. The system of claim 21, wherein the processor is further configured to evaluate the hostname to determine if another file should be downloaded.
  - 27. The system of claim 21, wherein the memory is further configured to store the address with the second hash value.
  - 28. The system of claim 27, wherein the processor is further configured to compare the address to a collection of one or more other addresses.
  - 29. The system of claim 21, wherein the processor is further configured to determine whether the second hash value matches a stored hash value.
  - 30. The system of claim 29, wherein the processor is further configured to identify an address of another file that corresponds to the stored hash value when the second hash value matches the stored hash value.

31. A system, comprising:
- a memory configured to store data associated with a hostname and an address, the memory configured to be in data communication with a processor; and
  
  a logic module configured;
  
  to evaluate an address associated with a hostname,to compare the address to the collection to determine if the address is stored in the collection,to process the address to download a first file identified by the address if the address is not stored in the collection or determining if another address is indicated by the address if the address is stored in the collection,to identify a standardized portion of the first file by identifying a data set to be selected consistently from the first file, wherein the data set is identified based on a size and a location associated with the first file,to run a first hashing algorithm against the standardized portion of the first file to generate a first hash value,to determine whether the first hash value is the same as or substantially similar to another hash value associated with a standardized portion of a second file, andto run a second hashing algorithm different from the first hashing algorithm against the standardized portion of the first file to generate a second hash value different from the first hash value in response to determining that the first hash value is the same as or substantially similar to the hash value associated with the standardized portion of the second file.
- View Dependent Claims (32, 33, 34, 35, 36)
- - 32. The system of claim 31, wherein the logic module is further configured to download a file associated with the address.
  - 33. The system of claim 31, wherein the logic module is further configured to store the address if the address is not associated with the hostname.
  - 34. The system of claim 31, wherein the memory is further configured to store the address and the second hash value.
  - 35. The system of claim 31, wherein the logic module is further configured to determine whether the second hash value matches a stored hash value.
  - 36. The system of claim 35, wherein the logic module is further configured to identify an address of another file that corresponds to the stored hash value when the second hash value matches the stored hash value.

37. A system, comprising:
- one or more databases configured to store data associated with one or more addresses;
  
  a crawler instance configured to crawl the one or more addresses, wherein the crawler is registered with one or more databases; and
  
  a distributed processor network configured;
  
  to evaluate an address associated with a hostname,to compare the address to another address stored in a collection,to process the address to download a first file identified by the address if the address is not stored in the collection,to identify a standardized portion of the first file by identifying a data set to be selected consistently from the first file, wherein the data is identified based on a size and a location associated with the first file,to run a first hashing algorithm against the standardized portion of the first file to generate a first hash value,to determine whether the first hash value is the same as or substantially similar to another hash value associated with a standardized portion of a second file, andto run a second hashing algorithm different from the first hashing algorithm against the standardized portion of the first file to generate a second hash value different from the first hash value in response to determining that the first hash value is the same as or substantially similar to the hash value associated with the standardized portion of the second file.
- View Dependent Claims (38, 39)
- - 38. The system of claim 37, wherein the distributed processor network is further configured to determine whether the second hash value matches a stored hash value.
  - 39. The system of claim 38, wherein the distributed processor network is further configured to identify an address of another file that corresponds to the stored hash value when the second hash value matches the stored hash value.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Concert Technology Corporation
Original Assignee
Pinehill Technology, LLC (Concert Technology Corporation)
Inventors
Kaminski, Charles Jr.
Primary Examiner(s)
Colan, Giovanna

Application Number

US11/732,834
Time in Patent Office

1,853 Days
Field of Search

707/10, 707/101, 707/3, 707/102, 707/698, 707/747, 707/741, 709/225, 726/23, 711/216
US Class Current

707/698
CPC Class Codes

G06F 16/50   of still image data

G06F 16/532   Query formulation, e.g. gra...

G06F 16/951   Indexing; Web crawling tech...

Use of hash values for identification and location of content

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

139 Citations

39 Claims

Specification

Solutions

Use Cases

Quick Links

Use of hash values for identification and location of content

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

139 Citations

39 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links