Use of hash values for identification and location of content
First Claim
Patent Images
1. A method, comprising:
- retrieving a hostname, the hostname being evaluated to determine if an address is associated with the hostname;
detecting the address, the address being further processed to download a first file identified by the address if the first file is associated with the hostname or storing the address if the address points to another hostname;
identifying a standardized portion of the first file by identifying a data set to be selected consistently from the first file, wherein the data set is identified based on a size and a location associated with the first file;
running a first hashing algorithm against the standardized portion of the first file to generate a first hash value;
determining whether the first hash value is the same as or substantially similar to another hash value associated with a standardized portion of a second file;
in response to determining that the first hash value is the same as or substantially similar to the hash value associated with the standardized portion of the second file, running a second hashing algorithm different from the first hashing algorithm against the standardized portion of the first file to generate a second hash value different from the first hash value; and
storing the second hash value and the address associated with the first file.
7 Assignments
0 Petitions
Accused Products
Abstract
Surrogate hashing is described, including initializing one or more variables in a collection, evaluating an address associated with a host, comparing the address to the collection to determine if the address is stored in the collection, and processing the address to hash a file identified by the address if the address is not stored in the collection or determining if another address is indicated by the address if the address is stored in the collection.
139 Citations
39 Claims
-
1. A method, comprising:
-
retrieving a hostname, the hostname being evaluated to determine if an address is associated with the hostname; detecting the address, the address being further processed to download a first file identified by the address if the first file is associated with the hostname or storing the address if the address points to another hostname; identifying a standardized portion of the first file by identifying a data set to be selected consistently from the first file, wherein the data set is identified based on a size and a location associated with the first file; running a first hashing algorithm against the standardized portion of the first file to generate a first hash value; determining whether the first hash value is the same as or substantially similar to another hash value associated with a standardized portion of a second file; in response to determining that the first hash value is the same as or substantially similar to the hash value associated with the standardized portion of the second file, running a second hashing algorithm different from the first hashing algorithm against the standardized portion of the first file to generate a second hash value different from the first hash value; and storing the second hash value and the address associated with the first file. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 16, 17, 18, 19)
-
-
12. A method for file identification, comprising:
-
evaluating an address associated with a hostname; comparing the address to the collection to determine if the address is stored in the collection; processing the address to download a first file identified by the address if the address is not stored in the collection or determining if another address is indicated by the address if the address is stored in the collection; identifying a standardized portion of the first file by identifying a data set to be selected consistently from the first file, wherein the data set is identified based on a size and a location associated with the first file; running a first hashing algorithm against the standardized portion of the first file to generate a first hash value; determining whether the first hash value is the same as or substantially similar to another hash value associated with a standardized portion of a second file; and in response to determining that the first hash value is the same as or substantially similar to the hash value associated with the standardized portion of the second file, running a second hashing algorithm different from the first hashing algorithm against the standardized portion of the first file to generate a second hash value different from the first hash value. - View Dependent Claims (13, 14)
-
-
15. A computer program product embodied in a non-transitory computer readable medium and comprising computer instructions for:
-
retrieving a hostname, the hostname being evaluated to determine if an address is associated with the hostname; detecting the address, the address being further processed to download a first file identified by the address if the first file is associated with the hostname or storing the address if the address points to another hostname; identifying a standardized portion of the first file by identifying a data set to be selected consistently from the first file, wherein the data set is identified based on a size and a location associated with the first file; running a first hashing algorithm against the standardized portion of the first file to generate a first hash value determining whether the first hash value is the same as or substantially similar to another hash value associated with a standardized portion of a second file; in response to determining that the first hash value is the same as or substantially similar to the hash value associated with the standardized portion of the second file, running a second hashing algorithm different from the first hashing algorithm against the standardized portion of the first file to generate a second hash value different from the first hash value; and storing the address associated with the first file, the first hash value, and the second hash value.
-
-
20. A system, comprising:
-
at least one processor and memory configured to; evaluate an address associated with a hostname; compare the address to the collection to determine if the address is stored in the collection; process the address to download a first file identified by the address if the address is not stored in the collection or determining if another address is indicated by the address if the address is stored in the collection; identify a standardized portion of the first file by identifying a data set to be selected consistently from the first file, wherein the data set is identified based on a size and a location associated with the first file; run a first hashing algorithm against the standardized portion of the first file to generate a first hash value determine whether the first hash value is the same as or substantially similar to another hash value associated with a standardized portion of a second file; and in response to determining that the first hash value is the same as or substantially similar to the hash value associated with the standardized portion of the second file, run a second hashing algorithm different from the first hashing algorithm against the standardized portion of the first file to generate a second hash value.
-
-
21. A system, comprising:
-
a memory configured to be in data communication with a processor; and a processor configured; to retrieve a hostname, the hostname being evaluated to determine if an address is associated with the hostname; to detect the address, the address being further processed to download a first file identified by the address if the first file is associated with the hostname or storing the address if the address points to another hostname; to identify a standardized portion of the first file by identifying a data set to be selected consistently from the first file, wherein the data set is identified based on a size and a location associated with the first file; to run a first hashing algorithm against the standardized portion of the first file to generate a first hash value; to determine whether the first hash value is the same as or substantially similar to another hash value associated with a standardized portion of a second file; to run a second hashing algorithm different from the first hashing algorithm against the standardized portion of the first file to generate a second hash value different from the first hash value in response to determining that the first hash value is the same as or substantially similar to the hash value associated with the standardized portion of the second file; and to store the second hash value and the address associated with the first file. - View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30)
-
-
31. A system, comprising:
-
a memory configured to store data associated with a hostname and an address, the memory configured to be in data communication with a processor; and a logic module configured; to evaluate an address associated with a hostname, to compare the address to the collection to determine if the address is stored in the collection, to process the address to download a first file identified by the address if the address is not stored in the collection or determining if another address is indicated by the address if the address is stored in the collection, to identify a standardized portion of the first file by identifying a data set to be selected consistently from the first file, wherein the data set is identified based on a size and a location associated with the first file, to run a first hashing algorithm against the standardized portion of the first file to generate a first hash value, to determine whether the first hash value is the same as or substantially similar to another hash value associated with a standardized portion of a second file, and to run a second hashing algorithm different from the first hashing algorithm against the standardized portion of the first file to generate a second hash value different from the first hash value in response to determining that the first hash value is the same as or substantially similar to the hash value associated with the standardized portion of the second file. - View Dependent Claims (32, 33, 34, 35, 36)
-
-
37. A system, comprising:
-
one or more databases configured to store data associated with one or more addresses; a crawler instance configured to crawl the one or more addresses, wherein the crawler is registered with one or more databases; and a distributed processor network configured; to evaluate an address associated with a hostname, to compare the address to another address stored in a collection, to process the address to download a first file identified by the address if the address is not stored in the collection, to identify a standardized portion of the first file by identifying a data set to be selected consistently from the first file, wherein the data is identified based on a size and a location associated with the first file, to run a first hashing algorithm against the standardized portion of the first file to generate a first hash value, to determine whether the first hash value is the same as or substantially similar to another hash value associated with a standardized portion of a second file, and to run a second hashing algorithm different from the first hashing algorithm against the standardized portion of the first file to generate a second hash value different from the first hash value in response to determining that the first hash value is the same as or substantially similar to the hash value associated with the standardized portion of the second file. - View Dependent Claims (38, 39)
-
Specification