×

Device and method for full-text large-dictionary string matching using n-gram hashing

  • US 6,169,969 B1
  • Filed: 02/10/1999
  • Issued: 01/02/2001
  • Est. Priority Date: 08/07/1998
  • Status: Expired due to Term
First Claim
Patent Images

1. A dictionary string matching method for locating all matches of a keyword dictionary in a sample byte stream, said keyword dictionary consisting of d keywords, each of said keywords being composed of a sequence of bytes of a general nature, comprising the steps of:

  • (a) initializing, comprising the steps of building and clearing m hash tables, said m hash tables being composed of presence values, said presence values being binary entries of logical value, said clearing consisting of the setting the value of each of said binary entries to false;

    (b) thereafter, for each distinct length of keywords in the dictionary, choosing m n-gram selection positions;

    (c) thereafter, for each integer i taking values of 1 through d, inclusive, performing the steps of;

    (i) determining the length of the ith of the keywords;

    (ii) extracting m keyword n-grams from the ith keyword beginning at said m n-gram selection positions in the keyword, respectively;

    (iii) hashing each of said keyword n-grams, producing m keyword hash addresses, each of said keyword hash addresses corresponding to one of said keyword n-grams;

    said hashing being performed using m hashing functions; and

    (iv) recording, comprising the step of posting said m keyword n-grams in said m hash tables, performed by registering each of said m keyword n-grams in the corresponding one of said m hash tables, said registering within one of said m hash tables being performed by setting one of said presence values therein to the value of true, at the address identified by the corresponding one of said m keyword hash addresses;

    (d) thereafter, creating m presence values streams by performing, for each position in the sample byte stream, n, n+1, n+2, . . . , in order, the steps of;

    (i) extracting from the sample byte stream the sample n-gram consisting of the n consecutive bytes terminating in the byte at said position;

    (ii) computing m sample hash addresses for said sample n-gram, the jth of said sample hash addresses calculated by applying to said sample n-gram the jth of said m hashing functions;

    (iii) reading m sample presence values from said m hash tables, the jth of said m sample presence values obtained by reading the value from the jth of said m hash tables at the address identified by the jth of said m sample hash addresses; and

    (iv) appending said m sample presence values to said m presence values streams by appending the jth of said m sample presence values to the jth of said m presence values streams;

    (e) generating primary alarms, by performing, for each distinct length of the keywords in the dictionary, the steps of;

    (i) calculating m delaying amounts, the jth of said m delaying amounts being additively complementary to the jth of said m n-gram selection positions for said distinct length of the keywords in the dictionary;

    (i) forming m delayed presence values streams according to said m delaying amounts, the jth of said m delayed presence values streams being produced by delaying the jth of said m presence values streams by an amount equal to the jth of said m delaying amounts; and

    (ii) applying an alarm sensing method to said m delayed presence values streams, producing one of said primary alarms only when said alarm sensing method senses the coincidence of true values in each of said m delayed presence values streams; and

    (f) for each of said primary alarms, applying a secondary testing method to verify said primary alarm, resulting in a match report if verification holds.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×