Regular expression acceleration engine and processing model
First Claim
1. A method of recognizing a lexeme in a data file comprising a plurality of symbols, the method comprising:
- generating one or more regular expression queries;
generating a deterministic finite automata (DFA) based on the regular expression queries;
executing the DFA on the data file, wherein the executing comprises identifying a first lexeme in the data file after evaluating one or more symbols of the data file;
storing in a storage device a location in the data file associated with a last symbol of the first lexeme;
evaluating one or more additional symbols of the data file;
determining if the first lexeme is a part of a second lexeme comprising the one or more additional symbols; and
if the first lexeme is not a part of the second lexeme, reporting the identification of the first lexeme and evaluating additional symbols starting with a symbol immediately following the stored location.
2 Assignments
0 Petitions
Accused Products
Abstract
Optimization for improved construction and execution of state machines configured to identify lexemes in data files is disclosed. This optimization includes, for example, systems and methods for disambiguating between overlapping matches found in data files, using trailing context regular expressions, removing stall states from state machines, selecting between a plurality of sets of regular expressions, analyzing multiple data files concurrently, analyzing portions of a single data file concurrently, representing state machines using instructions representative of transitions between states, and using virtual terminal instructions.
196 Citations
58 Claims
-
1. A method of recognizing a lexeme in a data file comprising a plurality of symbols, the method comprising:
-
generating one or more regular expression queries;
generating a deterministic finite automata (DFA) based on the regular expression queries;
executing the DFA on the data file, wherein the executing comprises identifying a first lexeme in the data file after evaluating one or more symbols of the data file;
storing in a storage device a location in the data file associated with a last symbol of the first lexeme;
evaluating one or more additional symbols of the data file;
determining if the first lexeme is a part of a second lexeme comprising the one or more additional symbols; and
if the first lexeme is not a part of the second lexeme, reporting the identification of the first lexeme and evaluating additional symbols starting with a symbol immediately following the stored location. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method of recognizing a lexeme in a data file comprising a plurality of symbols, the method comprising:
-
generating a regular expression query including a lexeme and a trailing context, wherein each of the lexeme and the trailing context includes one or more symbols;
generating a deterministic finite automata (DFA) based on the regular expression query;
executing the DFA on the data file, wherein the executing comprises identifying the lexeme in the data file after evaluating one or more symbols of the data file;
storing in a storage device a trail head location indicating a position of the symbol immediately following the lexeme;
evaluating one or more additional symbols of the data file;
determining if the additional symbols match the trailing context; and
if the additional symbols match the trailing context, reporting the identification of the lexeme. - View Dependent Claims (8, 9, 10)
-
-
11. A compiler configured to generate a deterministic finite automata (DFA) based at least partly upon one or more regular expression queries, the compiler comprising:
-
means for determining one or more non-terminal states that occur logically after a non-terminal accepting state and before either of (1) a next non-terminal accepting state or (2) a terminal state; and
means for associating a state transition instruction of the non-terminal accepting state with each of the determined one or more non-terminal states. - View Dependent Claims (12)
-
-
13. A method of removing stall states from a state machine, the method comprising:
-
(a) identifying a non-terminal accepting state by searching one or more states downstream from an initial state, wherein a lexeme is associated with the non-terminal accepting state;
(b) identifying a non-terminal non-accepting state downstream from the identified non-terminal accepting state;
(c) associating information identifying the lexeme with the non-terminal non-accepting state; and
(d) repeating steps b and c until another non-terminal accepting state or a terminal state is reached. - View Dependent Claims (14)
-
-
15. A method of selecting one set of regular expression queries among a plurality of sets of regular expression queries, the method comprising:
-
storing a plurality of regular expression queries in a computing device;
receiving a data file comprising a plurality of symbols;
identifying a start condition value in the received data file; and
determining one set of regular expression queries that corresponds with the start condition. - View Dependent Claims (16, 17, 18)
-
-
19. A method of switching between sets of regular expression queries, the method comprising:
-
storing a plurality of sets of regular expression queries in a computing device;
receiving a data file comprising a plurality of symbols;
identifying a start condition value in the received data file;
determining a set of regular expression queries from the stored plurality of sets of regular expression queries that corresponds with the start condition;
analyzing one or more symbols of the data file according to the determined set of regular expression queries;
identifying, based on the one or more symbols of the data file, another set of regular expression queries; and
executing the identified another set of regular expression queries. - View Dependent Claims (20, 21, 22, 23, 24, 25)
-
-
26. A method of lexically analyzing a data file, the method comprising:
-
(a) providing a first rule set corresponding to a first set of regular expressions;
(b) identifying a first lexeme in the data file based at least partly upon the first rule set;
(c) based on the identified first lexeme, identifying a second rule set corresponding to a second set of regular expressions; and
(d) analyzing the data file according to the second rule set. - View Dependent Claims (27, 28)
-
-
29. A method of lexically analyzing a data file, the method comprising:
-
(a) providing a Nth rule set corresponding to a Nth set of regular expressions;
(b) identifying a Nth lexeme in the data file according to the Nth rule set;
(c) based on the identified first lexeme, identifying a N+1th rule set corresponding to a N+1th set of regular expressions;
(d) setting N equal to N+1; and
(e) repeating steps b-d.
-
-
30. A system for lexically analyzing a data file, the system comprising:
-
(a) means for providing a Nth rule set corresponding to a Nth set of regular expressions;
(b) means for identifying a Nth lexeme in the data file according to the Nth rule set;
(c) means for identifying a N+1th rule set corresponding to a N+1th set of regular expressions based on the identified first lexeme;
(d) means for setting N equal to N+1;
(e) means for repeating steps b-d.
-
-
31. A system for locating one or more tokens in a plurality of data files, each data file comprising a plurality of symbols, the system comprising:
-
a storage device for storing at least a portion of one or more regular expression queries;
a compiler configured to generate a deterministic finite automata (DFA) based at least partly upon the one or more regular expression queries, an execution engine configured to operate on the plurality of data files according to the DFA, wherein the execution engine is configured to process one symbol every M clock cycles; and
a multiplexer coupled to the execution engine and configured to receive symbols from at least M of the plurality of data files, wherein the execution engine receives one symbol from each of the M data files at least every M clock cycles.
-
-
32. A method of locating one or more tokens in M data files, each data file comprising a plurality of symbols, the method comprising:
-
receiving one or more regular expression queries;
generating a deterministic finite automata (DFA) based at least partly upon the one or more regular expression queries; and
operating on the plurality of data files according to the DFA, wherein the execution engine receives one symbol from each of the M data files at least every M clock cycles.
-
-
33. A system for locating one or more tokens in M data files, each data file comprising a plurality of symbols, the system comprising:
-
means for receiving one or more regular expression queries;
means for generating a deterministic finite automata (DFA) based at least partly upon the one or more regular expression queries; and
means for operating on the plurality of data files according to the DFA, wherein the execution engine receives one symbol from each of the M data files at least every M clock cycles.
-
-
34. An apparatus for processing a single data file comprising a plurality of symbols, the apparatus comprising:
-
a segmenter configured to divide the file into M regions;
M storage locations each configured to buffer portions of one of the M regions;
a core execution unit configured to execute a state machine, wherein movement from a current state to a next state in the state machine requires M clock cycles, the core execution unit comprising a storage device for storing information indicating one or more boundaries between the M regions, wherein the core execution unit reads a symbol from one of the M storage locations during each clock cycle. - View Dependent Claims (35, 36, 37, 38, 39, 40, 41)
-
-
42. A method of representing a state machine, the method comprising:
-
(a) determining a number M of out transitions from a Nth state in the state machine;
(b) generating an instruction corresponding to each of the M transitions from the Nth state, wherein each of the instructions includes an indication of a next state in the state machine;
(c) repeating steps a and b for each of the states of the state machine; and
(d) storing at least some of the instructions for each of the states of the state machine in a storage device, wherein the indication of the next state in the one or more instructions is usable to determine an address of the next state in the storage device. - View Dependent Claims (43, 44, 45, 46, 47, 48, 49, 56)
-
-
50. A method of moving between a plurality of states of a state machine, wherein a plurality of instructions indicate transitions between states of the state machine, the method comprising:
-
selecting an instruction corresponding to a transition from a first state, wherein the act of selecting is based, at least partly, on one or more current symbol classes;
setting an offset according to one or more of the current symbol classes and one or more fields of the selected instruction;
determining an address of a next state by adding the offset to an address of the selected instruction. - View Dependent Claims (51, 52, 53, 54, 55)
-
-
57. A state machine comprising:
-
a plurality of instructions, each instruction representing a transition from one state to another state in a state machine; and
a virtual terminal instruction including (a) information indicating an output that corresponds to a state associated with the virtual terminal instruction and (b) information usable to determine a next state, wherein by executing the virtual terminal instruction, the state machine transitions from the state associated with the virtual terminal instruction to the determined next state in a single clock cycle. - View Dependent Claims (58)
-
Specification