Device and method for full-text large-dictionary string matching using n-gram hashing
First Claim
1. A dictionary string matching method for locating all matches of a keyword dictionary in a sample byte stream, said keyword dictionary consisting of d keywords, each of said keywords being composed of a sequence of bytes of a general nature, comprising the steps of:
- (a) initializing, comprising the steps of building and clearing m hash tables, said m hash tables being composed of presence values, said presence values being binary entries of logical value, said clearing consisting of the setting the value of each of said binary entries to false;
(b) thereafter, for each distinct length of keywords in the dictionary, choosing m n-gram selection positions;
(c) thereafter, for each integer i taking values of 1 through d, inclusive, performing the steps of;
(i) determining the length of the ith of the keywords;
(ii) extracting m keyword n-grams from the ith keyword beginning at said m n-gram selection positions in the keyword, respectively;
(iii) hashing each of said keyword n-grams, producing m keyword hash addresses, each of said keyword hash addresses corresponding to one of said keyword n-grams;
said hashing being performed using m hashing functions; and
(iv) recording, comprising the step of posting said m keyword n-grams in said m hash tables, performed by registering each of said m keyword n-grams in the corresponding one of said m hash tables, said registering within one of said m hash tables being performed by setting one of said presence values therein to the value of true, at the address identified by the corresponding one of said m keyword hash addresses;
(d) thereafter, creating m presence values streams by performing, for each position in the sample byte stream, n, n+1, n+2, . . . , in order, the steps of;
(i) extracting from the sample byte stream the sample n-gram consisting of the n consecutive bytes terminating in the byte at said position;
(ii) computing m sample hash addresses for said sample n-gram, the jth of said sample hash addresses calculated by applying to said sample n-gram the jth of said m hashing functions;
(iii) reading m sample presence values from said m hash tables, the jth of said m sample presence values obtained by reading the value from the jth of said m hash tables at the address identified by the jth of said m sample hash addresses; and
(iv) appending said m sample presence values to said m presence values streams by appending the jth of said m sample presence values to the jth of said m presence values streams;
(e) generating primary alarms, by performing, for each distinct length of the keywords in the dictionary, the steps of;
(i) calculating m delaying amounts, the jth of said m delaying amounts being additively complementary to the jth of said m n-gram selection positions for said distinct length of the keywords in the dictionary;
(i) forming m delayed presence values streams according to said m delaying amounts, the jth of said m delayed presence values streams being produced by delaying the jth of said m presence values streams by an amount equal to the jth of said m delaying amounts; and
(ii) applying an alarm sensing method to said m delayed presence values streams, producing one of said primary alarms only when said alarm sensing method senses the coincidence of true values in each of said m delayed presence values streams; and
(f) for each of said primary alarms, applying a secondary testing method to verify said primary alarm, resulting in a match report if verification holds.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and apparatus providing full-text scanning for matches in a large dictionary is described. The invention is suitable for SDI (selective dissemination of information) systems, accommodating large dictionaries (104 to 105 entries) and rapid processing. A preferred embodiment employs a hardware primary test on a single commercially-available gate-array board hosted by a computer, in which a software secondary test is conducted. No delimiter cues such as spaces or punctuation are required.
261 Citations
8 Claims
-
1. A dictionary string matching method for locating all matches of a keyword dictionary in a sample byte stream, said keyword dictionary consisting of d keywords, each of said keywords being composed of a sequence of bytes of a general nature, comprising the steps of:
-
(a) initializing, comprising the steps of building and clearing m hash tables, said m hash tables being composed of presence values, said presence values being binary entries of logical value, said clearing consisting of the setting the value of each of said binary entries to false;
(b) thereafter, for each distinct length of keywords in the dictionary, choosing m n-gram selection positions;
(c) thereafter, for each integer i taking values of 1 through d, inclusive, performing the steps of;
(i) determining the length of the ith of the keywords;
(ii) extracting m keyword n-grams from the ith keyword beginning at said m n-gram selection positions in the keyword, respectively;
(iii) hashing each of said keyword n-grams, producing m keyword hash addresses, each of said keyword hash addresses corresponding to one of said keyword n-grams;
said hashing being performed using m hashing functions; and
(iv) recording, comprising the step of posting said m keyword n-grams in said m hash tables, performed by registering each of said m keyword n-grams in the corresponding one of said m hash tables, said registering within one of said m hash tables being performed by setting one of said presence values therein to the value of true, at the address identified by the corresponding one of said m keyword hash addresses;
(d) thereafter, creating m presence values streams by performing, for each position in the sample byte stream, n, n+1, n+2, . . . , in order, the steps of;
(i) extracting from the sample byte stream the sample n-gram consisting of the n consecutive bytes terminating in the byte at said position;
(ii) computing m sample hash addresses for said sample n-gram, the jth of said sample hash addresses calculated by applying to said sample n-gram the jth of said m hashing functions;
(iii) reading m sample presence values from said m hash tables, the jth of said m sample presence values obtained by reading the value from the jth of said m hash tables at the address identified by the jth of said m sample hash addresses; and
(iv) appending said m sample presence values to said m presence values streams by appending the jth of said m sample presence values to the jth of said m presence values streams;
(e) generating primary alarms, by performing, for each distinct length of the keywords in the dictionary, the steps of;
(i) calculating m delaying amounts, the jth of said m delaying amounts being additively complementary to the jth of said m n-gram selection positions for said distinct length of the keywords in the dictionary;
(i) forming m delayed presence values streams according to said m delaying amounts, the jth of said m delayed presence values streams being produced by delaying the jth of said m presence values streams by an amount equal to the jth of said m delaying amounts; and
(ii) applying an alarm sensing method to said m delayed presence values streams, producing one of said primary alarms only when said alarm sensing method senses the coincidence of true values in each of said m delayed presence values streams; and
(f) for each of said primary alarms, applying a secondary testing method to verify said primary alarm, resulting in a match report if verification holds. - View Dependent Claims (2, 3)
(a) said step of initializing also includes the steps of;
(i) building and clearing a length-enabling table, said length-enabling table consisting of logical words; and
(ii) choosing a length-recording index from among the integers 1 through m, inclusive;
(b) said step of recording further comprises the additional step of posting information about the length of the ith of the keywords, said step of posting information about the length of the ith of the keywords accomplished by setting a bit in the logical word in said length-enabling table at the location identified by the one of said m keyword hash addresses corresponding to said length-recording index; and
(c) wherein said alarm sensing method consults said length-enabling table, said length-enabling table being addressed by the one of said sample hash addresses corresponding to said length-recording index.
-
-
3. A dictionary string matching method according to claim 1, wherein said step of computing m sample hash addresses is implemented in a recursive manner.
-
4. An apparatus for locating all matches of a keyword dictionary in a sample byte stream, comprising:
-
(a) at least one primary testing means, each of said primary testing means further comprising;
(i) a plurality m of n-gram hashers, each of which produces one of a plurality m of hash address streams from the sample byte stream, such that for each successive n-gram in the sample byte stream, one element in each of said plurality m of hash address streams is computed from the successive n-gram of the sample byte stream;
(ii) a plurality m of hash tables, containing binary entries, prepared in advance from the n-gram contents of the keywords in the keyword dictionary, and connected such that each of said plurality m of hash tables accepts one of said plurality m of hash address streams and produces one of a plurality m of binary presence streams;
(iii) for each distinct length of the keywords in the keyword dictionary, means of shifting each of said plurality m of binary presence streams producing a plurality m of shifted binary presence streams; and
(iv) alarm sensing means, which, for each distinct length of the keywords in the keyword dictionary, operates on said plurality m of shifted binary presence streams to produce primary alarms; and
(b) secondary testing means which, upon receiving one of said primary alarms, examines the sample byte stream for matches indicated by said one of said primary alarms, and issuing a match report if any of said matches are verified. - View Dependent Claims (5, 6, 7, 8)
(a) a plurality of AND gates, each of said plurality of AND gates producing a length-specific alarm signal, each of said plurality of AND gates corresponding to one of the distinct lengths of keywords in the keyword dictionary and driven by said plurality m of shifted binary presence streams corresponding to one of the distinct lengths of keywords in the keyword dictionary; and
(b) means for producing one of said primary alarms when one or more of said length-specific alarm signals indicates an alarm.
-
-
6. The apparatus for locating all matches of a keyword dictionary in a sample byte stream as recited in claim 5, wherein:
-
(a) a distinguished hash address stream is chosen from among one of said plurality m of hash address streams;
(b) said alarm sensing means further comprises a length-enabling table addressed by said distinguished hash address stream, producing a length-enabling word stream; and
(c) each one of said plurality of AND gates of said alarm sensing means is further driven by a bit stream derived from said length-enabling word stream.
-
-
7. The apparatus for locating all matches of a keyword dictionary in a sample byte stream as recited in claim 4, wherein each of said plurality of n-gram hashers is recursive, each producing an output that depends only upon its previous output, the current byte of the sample byte stream, and the byte n samples before in the sample byte stream.
-
8. The apparatus for locating all matches of a keyword dictionary in a sample byte stream as recited in claim 4, wherein a portion of the apparatus is implemented using gate arrays.
Specification