Grouping and differentiating files based on content
First Claim
1. In a computing system environment, a method of differentiating files stored on one or more computing devices, each file having a plurality of symbols derived from an underlying data stream of all original bits of raw data of said each file, comprising:
- encoding said each file as a plurality of symbols representing an underlying data stream of all original bits of binary data of the file;
determining a number of occurrences of each said symbol in said each file; and
computing a distance between said each file and every other file based on the determined number of occurrences.
16 Assignments
0 Petitions
Accused Products
Abstract
Methods and apparatus teach a digital spectrum of a file. The digital spectrum is used to map a file'"'"'s position. This position relative to another file'"'"'s position reveals distances between the files. Representatively, files have a plurality of symbols representing an underlying data stream of original bits of data. The number of occurrences of each symbol in each file is compared to like symbols in other files. This can occur via algorithms, mapping, or both. In certain instances, comparison reveals a difference in counts between the symbols of the files. This difference is then squared, added together, and a square root taken. Comparing “distance values” reveals file adjacency, grouping, or the like. Also, normalizing, weighting, filtering functions and/or other statistical computations are applied in certain instances.
-
Citations
19 Claims
-
1. In a computing system environment, a method of differentiating files stored on one or more computing devices, each file having a plurality of symbols derived from an underlying data stream of all original bits of raw data of said each file, comprising:
-
encoding said each file as a plurality of symbols representing an underlying data stream of all original bits of binary data of the file; determining a number of occurrences of each said symbol in said each file; and computing a distance between said each file and every other file based on the determined number of occurrences. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
- 10. In a computing system environment, a method of differentiating files stored on one or more computing devices, each file having a plurality of symbols derived from an underlying data stream of all original bits of raw data of said each file, comprising encoding said each file as a plurality of symbols representing an underlying data stream of all original bits of binary data of the file, determining a number of occurrences of each said symbol in said each file, and computing a distance in an N-dimensional space between said each file and every other file based on said determined number of occurrences, said N-dimensional space being defined by a total number N of different symbols in the plurality of symbols.
-
15. In a computing system environment, a method of determining closest files stored on one or more computing devices, each file having a plurality of symbols derived from an underlying data stream of all original bits of raw data of said each file, comprising:
-
encoding said each file as a plurality of symbols representing an underlying data stream of all original bits of binary data of the file; determining a number of occurrences of each said symbol in said each file; computing a distance value between said each file and every other file based on said number of occurrences; and concluding a closest two files based on the computed distance value. - View Dependent Claims (16, 17, 18)
-
-
19. In a computing system environment, a method of differentiating files stored on one or more computing devices, each file having a plurality of symbols derived from an underlying data stream of all original bits of raw data of said each file, comprising:
-
encoding said each file as a plurality of symbols representing an underlying data stream of all original bits of binary data of the file; defining an N-dimensional space wherein N is a total number of different symbols in the plurality of symbols; determining a number of occurrences of each said symbol in said each file; and computing a distance in the N-dimensional space between said each file and every other file of the stored files based on the determined number of occurrences.
-
Specification