Method of identifying data type and locating in a file
First Claim
1. A method of identifying the type of data contained in an electronic file of unknown data type, comprising the steps:
- a) gathering at least one exemplary file of each data type of interest;
b) counting the number of unique n-grams within each at least one exemplary file,where n is user-definable range of integers;
c) determining a weight for each unique n-gram for each data type of interest as follows;
space="preserve" listing-type="equation">w.sub.ti .tbd.(log(K(x.sub.ti /N.sub.t)+1))×
((1/log N.sub.t)(Σ
(p.sub.tji log(1/p.sub.tij))))×
(1-((1/logL)(Σ
(p.sub.ti log(1/p.sub.ti))))).sup.d,where wti is the weight of a particular n-gram i for a particular data type t;
where K is a constant that scales the term (xti /Nt) away from unity;
where Nt is the number of documents of type t;
where xtji is the number of occurrences of an n-gram i in exemplary document j of type t divided by the total number of bytes in document j;
where xti =Σ
xtji, summed from j=1 to j=Nt ;
where ptji.tbd.xtji /xti ;
where Σ
(ptji log(1/ptij)) is summed from j=1 to j=Nt ;
where L is the number of data types of interest;
where xi =Σ
xti, summed from t=1 to t=L;
where pti .tbd.xti /xi ;
where Σ
(pti log(1/pti)) is summed from t=1 to t=L; and
where d is a positive real number that controls the concavity of the term that contains d;
d) listing the unique n-grams in the at least one exemplary file of a particular data type of interest in order of descending magnitude of weight for each data type of interest;
e) selecting the top m weighted n-grams and their associated weights from each list generated in the last step, where m is a user-definable integer;
f) establishing a user-definable threshold for each data type of interest for determining data type;
g) selecting a user-definable length of data from the electronic file of unknown data type;
h) listing every n-gram in the data selected in the last step, where n has the same range of integer values as in step (b);
i) giving each n-gram listed in the last step that was also selected in step (e) the weight that that n-gram was given in step (c) for each data type of interest;
j) summing the weights given to each n-gram in the last step according to data type;
k) comparing the sums in the last step to the thresholds established in step (f) in order to determine the data types, if any, of the selected data;
l) recording the location of the selected data if it is determined to be of any data type of interest;
m) stopping if the number of selected lengths of data reached a user-definable number of selected lengths of data, otherwise selecting another length of data from the file that is the same length as the data selected previously, where the data selected in this step overlaps with the previously selected data by at least one position; and
n) repeating step (h) through step (m) using the data selected in the last step.
1 Assignment
0 Petitions
Accused Products
Abstract
A method of identifying the types of data contained in an electronic file of unknown data type by gathering exemplary files of each data type of interest; counting the number of unique n-grams within each exemplary file; determining a weight for each unique n-gram; listing the unique n-grams in the exemplary files of a particular data type by descending magnitude of weight for each data type of interest; selecting the top m weighted n-grams and their associated weights; establishing a threshold for each data type of interest; selecting a length of data from the electronic file; listing every n-gram in the data selected; giving each listed n-gram, that was also selected, the weight that that n-gram was given for each data type of interest; summing the weights given to each n-gram according to data type; comparing the sums to the thresholds established in order to determine the types, if any, of the selected data; recording the location of the selected data if it is of a data type of interest; stopping if the number of selected lengths of data reached a user-definable number, otherwise selecting another length of data from the file that is the same length as that selected previously, where the newly selected data overlaps with the previously selected data by at least one position; and repeating the steps from listing every n-gram to stopping using the newly selected data.
-
Citations
17 Claims
-
1. A method of identifying the type of data contained in an electronic file of unknown data type, comprising the steps:
-
a) gathering at least one exemplary file of each data type of interest; b) counting the number of unique n-grams within each at least one exemplary file, where n is user-definable range of integers; c) determining a weight for each unique n-gram for each data type of interest as follows;
space="preserve" listing-type="equation">w.sub.ti .tbd.(log(K(x.sub.ti /N.sub.t)+1))×
((1/log N.sub.t)(Σ
(p.sub.tji log(1/p.sub.tij))))×
(1-((1/logL)(Σ
(p.sub.ti log(1/p.sub.ti))))).sup.d,where wti is the weight of a particular n-gram i for a particular data type t; where K is a constant that scales the term (xti /Nt) away from unity; where Nt is the number of documents of type t; where xtji is the number of occurrences of an n-gram i in exemplary document j of type t divided by the total number of bytes in document j; where xti =Σ
xtji, summed from j=1 to j=Nt ;where ptji.tbd.xtji /xti ; where Σ
(ptji log(1/ptij)) is summed from j=1 to j=Nt ;where L is the number of data types of interest; where xi =Σ
xti, summed from t=1 to t=L;where pti .tbd.xti /xi ; where Σ
(pti log(1/pti)) is summed from t=1 to t=L; andwhere d is a positive real number that controls the concavity of the term that contains d; d) listing the unique n-grams in the at least one exemplary file of a particular data type of interest in order of descending magnitude of weight for each data type of interest; e) selecting the top m weighted n-grams and their associated weights from each list generated in the last step, where m is a user-definable integer; f) establishing a user-definable threshold for each data type of interest for determining data type; g) selecting a user-definable length of data from the electronic file of unknown data type; h) listing every n-gram in the data selected in the last step, where n has the same range of integer values as in step (b); i) giving each n-gram listed in the last step that was also selected in step (e) the weight that that n-gram was given in step (c) for each data type of interest; j) summing the weights given to each n-gram in the last step according to data type; k) comparing the sums in the last step to the thresholds established in step (f) in order to determine the data types, if any, of the selected data; l) recording the location of the selected data if it is determined to be of any data type of interest; m) stopping if the number of selected lengths of data reached a user-definable number of selected lengths of data, otherwise selecting another length of data from the file that is the same length as the data selected previously, where the data selected in this step overlaps with the previously selected data by at least one position; and n) repeating step (h) through step (m) using the data selected in the last step. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15)
-
-
9. The method of step 1, wherein said step of stopping if the number of selected lengths of data reached a user-definable number of selected lengths of data is comprised of stopping if the end of the electronic file is reached, otherwise selecting another length of data that is one byte position to the right of the previously selected length of data.
- 16. The method of step 15, wherein said step of stopping if the number of selected lengths of data reached a user-definable number of selected lengths of data is comprised of stopping if the end of the electronic file is reached, otherwise selecting another length of data that is one byte position to the right of the previously selected length of data.
Specification