×

Method of identifying data type and locating in a file

  • US 5,991,714 A
  • Filed: 04/22/1998
  • Issued: 11/23/1999
  • Est. Priority Date: 04/22/1998
  • Status: Expired due to Term
First Claim
Patent Images

1. A method of identifying the type of data contained in an electronic file of unknown data type, comprising the steps:

  • a) gathering at least one exemplary file of each data type of interest;

    b) counting the number of unique n-grams within each at least one exemplary file,where n is user-definable range of integers;

    c) determining a weight for each unique n-gram for each data type of interest as follows;

    
    
    space="preserve" listing-type="equation">w.sub.ti .tbd.(log(K(x.sub.ti /N.sub.t)+1))×

    ((1/log N.sub.t)(Σ

    (p.sub.tji log(1/p.sub.tij))))×

    (1-((1/logL)(Σ

    (p.sub.ti log(1/p.sub.ti))))).sup.d,where wti is the weight of a particular n-gram i for a particular data type t;

    where K is a constant that scales the term (xti /Nt) away from unity;

    where Nt is the number of documents of type t;

    where xtji is the number of occurrences of an n-gram i in exemplary document j of type t divided by the total number of bytes in document j;

    where xti

    xtji, summed from j=1 to j=Nt ;

    where ptji.tbd.xtji /xti ;

    where Σ

    (ptji log(1/ptij)) is summed from j=1 to j=Nt ;

    where L is the number of data types of interest;

    where xi

    xti, summed from t=1 to t=L;

    where pti .tbd.xti /xi ;

    where Σ

    (pti log(1/pti)) is summed from t=1 to t=L; and

    where d is a positive real number that controls the concavity of the term that contains d;

    d) listing the unique n-grams in the at least one exemplary file of a particular data type of interest in order of descending magnitude of weight for each data type of interest;

    e) selecting the top m weighted n-grams and their associated weights from each list generated in the last step, where m is a user-definable integer;

    f) establishing a user-definable threshold for each data type of interest for determining data type;

    g) selecting a user-definable length of data from the electronic file of unknown data type;

    h) listing every n-gram in the data selected in the last step, where n has the same range of integer values as in step (b);

    i) giving each n-gram listed in the last step that was also selected in step (e) the weight that that n-gram was given in step (c) for each data type of interest;

    j) summing the weights given to each n-gram in the last step according to data type;

    k) comparing the sums in the last step to the thresholds established in step (f) in order to determine the data types, if any, of the selected data;

    l) recording the location of the selected data if it is determined to be of any data type of interest;

    m) stopping if the number of selected lengths of data reached a user-definable number of selected lengths of data, otherwise selecting another length of data from the file that is the same length as the data selected previously, where the data selected in this step overlaps with the previously selected data by at least one position; and

    n) repeating step (h) through step (m) using the data selected in the last step.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×