Regular expression acceleration engine and processing model

US 20050273450A1
Filed: 05/21/2004
Published: 12/08/2005
Est. Priority Date: 05/21/2004
Status: Abandoned Application

First Claim

Patent Images

1. A method of recognizing a lexeme in a data file comprising a plurality of symbols, the method comprising:

generating one or more regular expression queries;

generating a deterministic finite automata (DFA) based on the regular expression queries;

executing the DFA on the data file, wherein the executing comprises identifying a first lexeme in the data file after evaluating one or more symbols of the data file;

storing in a storage device a location in the data file associated with a last symbol of the first lexeme;

evaluating one or more additional symbols of the data file;

determining if the first lexeme is a part of a second lexeme comprising the one or more additional symbols; and

if the first lexeme is not a part of the second lexeme, reporting the identification of the first lexeme and evaluating additional symbols starting with a symbol immediately following the stored location.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Optimization for improved construction and execution of state machines configured to identify lexemes in data files is disclosed. This optimization includes, for example, systems and methods for disambiguating between overlapping matches found in data files, using trailing context regular expressions, removing stall states from state machines, selecting between a plurality of sets of regular expressions, analyzing multiple data files concurrently, analyzing portions of a single data file concurrently, representing state machines using instructions representative of transitions between states, and using virtual terminal instructions.

196 Citations

58 Claims

1. A method of recognizing a lexeme in a data file comprising a plurality of symbols, the method comprising:
- generating one or more regular expression queries;
  
  generating a deterministic finite automata (DFA) based on the regular expression queries;
  
  executing the DFA on the data file, wherein the executing comprises identifying a first lexeme in the data file after evaluating one or more symbols of the data file;
  
  storing in a storage device a location in the data file associated with a last symbol of the first lexeme;
  
  evaluating one or more additional symbols of the data file;
  
  determining if the first lexeme is a part of a second lexeme comprising the one or more additional symbols; and
  
  if the first lexeme is not a part of the second lexeme, reporting the identification of the first lexeme and evaluating additional symbols starting with a symbol immediately following the stored location.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, further comprising storing in another storage device a last accepting state.
  - 3. The method of claim 2, wherein the last accepting state comprises information related to contents of an instruction pointer associated with the step of identifying the first lexeme.
  - 4. The method of claim 1, further comprising:
    - if the first lexeme is a part of the second lexeme, reporting the identification of the first lexeme and the second lexeme.
  - 5. The method of claim 1, further comprising:
    - if the first lexeme is a part of the second lexeme, reporting the identification of the second lexeme.
  - 6. The method of claim 1, wherein a width of the storage device corresponds to one of the group comprising 8, 16, 32, 64, and 128 bits.

7. A method of recognizing a lexeme in a data file comprising a plurality of symbols, the method comprising:
- generating a regular expression query including a lexeme and a trailing context, wherein each of the lexeme and the trailing context includes one or more symbols;
  
  generating a deterministic finite automata (DFA) based on the regular expression query;
  
  executing the DFA on the data file, wherein the executing comprises identifying the lexeme in the data file after evaluating one or more symbols of the data file;
  
  storing in a storage device a trail head location indicating a position of the symbol immediately following the lexeme;
  
  evaluating one or more additional symbols of the data file;
  
  determining if the additional symbols match the trailing context; and
  
  if the additional symbols match the trailing context, reporting the identification of the lexeme.
- View Dependent Claims (8, 9, 10)
- - 8. The method of claim 7, wherein if the additional symbols match the trailing context, evaluating additional symbols starting with the symbol indicated by the trail head location.
  - 9. The method of claim 7, wherein if the additional symbols do not match the trailing context, evaluating additional symbols starting with a location identified by a last accepting state.
  - 10. The method of claim 7, wherein if the additional symbols do not match the trailing context and there is not a stored last accepting state, evaluating additional symbols starting with the second symbol of the lexeme.

11. A compiler configured to generate a deterministic finite automata (DFA) based at least partly upon one or more regular expression queries, the compiler comprising:
- means for determining one or more non-terminal states that occur logically after a non-terminal accepting state and before either of (1) a next non-terminal accepting state or (2) a terminal state; and
  
  means for associating a state transition instruction of the non-terminal accepting state with each of the determined one or more non-terminal states.
- View Dependent Claims (12)
- - 12. The compiler of claim 11, wherein the state transition instruction includes any output instructions associated with the non-terminal accepting state.

13. A method of removing stall states from a state machine, the method comprising:
- (a) identifying a non-terminal accepting state by searching one or more states downstream from an initial state, wherein a lexeme is associated with the non-terminal accepting state;
  
  (b) identifying a non-terminal non-accepting state downstream from the identified non-terminal accepting state;
  
  (c) associating information identifying the lexeme with the non-terminal non-accepting state; and
  
  (d) repeating steps b and c until another non-terminal accepting state or a terminal state is reached.
- View Dependent Claims (14)
- - 14. The method of claim 13, further comprising repeating steps a-d for each of a plurality of initial states.

15. A method of selecting one set of regular expression queries among a plurality of sets of regular expression queries, the method comprising:
- storing a plurality of regular expression queries in a computing device;
  
  receiving a data file comprising a plurality of symbols;
  
  identifying a start condition value in the received data file; and
  
  determining one set of regular expression queries that corresponds with the start condition.
- View Dependent Claims (16, 17, 18)
- - 16. The method of claim 15, wherein each of the sets of regular expression queries comprises one or more regular expressions.
  - 17. The method of claim 15, wherein a jump table stores one or more start condition values each associated with an entry in a start state table.
  - 18. The method of claim 17, wherein each entry in the start state table is associated with a start location of each of the sets of regular expression queries.

19. A method of switching between sets of regular expression queries, the method comprising:
- storing a plurality of sets of regular expression queries in a computing device;
  
  receiving a data file comprising a plurality of symbols;
  
  identifying a start condition value in the received data file;
  
  determining a set of regular expression queries from the stored plurality of sets of regular expression queries that corresponds with the start condition;
  
  analyzing one or more symbols of the data file according to the determined set of regular expression queries;
  
  identifying, based on the one or more symbols of the data file, another set of regular expression queries; and
  
  executing the identified another set of regular expression queries.
- View Dependent Claims (20, 21, 22, 23, 24, 25)
- - 20. The method of claim 19, wherein each set of regular expression queries comprises one or more regular expressions.
  - 21. The method of claim 20, wherein two or more sets of regular expression queries each comprise a particular regular expression.
  - 22. The method of claim 19, wherein the act of identifying comprises identifying a lexeme in the data file that indicates the another set of regular expression queries.
  - 23. The method of claim 19, wherein the one or more symbols comprises a lexeme.
  - 24. The method of claim 23, wherein another start condition is associated with the lexeme.
  - 25. The method of claim 19, wherein:
    - if the one or more symbols matches a first predetermined pattern, the method further comprises executing a first regular expression query; and
      
      if the one or more symbols matches a second predetermined pattern, the method further comprises executing a second regular expression query.

26. A method of lexically analyzing a data file, the method comprising:
- (a) providing a first rule set corresponding to a first set of regular expressions;
  
  (b) identifying a first lexeme in the data file based at least partly upon the first rule set;
  
  (c) based on the identified first lexeme, identifying a second rule set corresponding to a second set of regular expressions; and
  
  (d) analyzing the data file according to the second rule set.
- View Dependent Claims (27, 28)
- - 27. The method of claim 26, wherein step d further comprises:
    - (e) identifying a second lexeme in the data file based at least partly upon the second rule set;
      
      (f) based on the identified second lexeme, identifying a third rule set corresponding to a third set of regular expressions; and
      
      (g) analyzing the data file according to the third rule set.
  - 28. The method of claim 27, wherein step g further comprises:
    - (h) identifying a third lexeme in the data file based at least partly upon the third rule set;
      
      (i) based on the identified third lexeme, identifying a fourth rule set corresponding to a fourth set of regular expressions; and
      
      (g) analyzing the data file according to the fourth rule set.

29. A method of lexically analyzing a data file, the method comprising:
- (a) providing a N^thrule set corresponding to a N^thset of regular expressions;
  
  (b) identifying a N^thlexeme in the data file according to the N^thrule set;
  
  (c) based on the identified first lexeme, identifying a N+1^thrule set corresponding to a N+1^thset of regular expressions;
  
  (d) setting N equal to N+1; and
  
  (e) repeating steps b-d.

30. A system for lexically analyzing a data file, the system comprising:
- (a) means for providing a N^thrule set corresponding to a N^thset of regular expressions;
  
  (b) means for identifying a N^thlexeme in the data file according to the N^thrule set;
  
  (c) means for identifying a N+1^thrule set corresponding to a N+1^thset of regular expressions based on the identified first lexeme;
  
  (d) means for setting N equal to N+1;
  
  (e) means for repeating steps b-d.

31. A system for locating one or more tokens in a plurality of data files, each data file comprising a plurality of symbols, the system comprising:
- a storage device for storing at least a portion of one or more regular expression queries;
  
  a compiler configured to generate a deterministic finite automata (DFA) based at least partly upon the one or more regular expression queries, an execution engine configured to operate on the plurality of data files according to the DFA, wherein the execution engine is configured to process one symbol every M clock cycles; and
  
  a multiplexer coupled to the execution engine and configured to receive symbols from at least M of the plurality of data files, wherein the execution engine receives one symbol from each of the M data files at least every M clock cycles.

32. A method of locating one or more tokens in M data files, each data file comprising a plurality of symbols, the method comprising:
- receiving one or more regular expression queries;
  
  generating a deterministic finite automata (DFA) based at least partly upon the one or more regular expression queries; and
  
  operating on the plurality of data files according to the DFA, wherein the execution engine receives one symbol from each of the M data files at least every M clock cycles.

33. A system for locating one or more tokens in M data files, each data file comprising a plurality of symbols, the system comprising:
- means for receiving one or more regular expression queries;
  
  means for generating a deterministic finite automata (DFA) based at least partly upon the one or more regular expression queries; and
  
  means for operating on the plurality of data files according to the DFA, wherein the execution engine receives one symbol from each of the M data files at least every M clock cycles.

34. An apparatus for processing a single data file comprising a plurality of symbols, the apparatus comprising:
- a segmenter configured to divide the file into M regions;
  
  M storage locations each configured to buffer portions of one of the M regions;
  
  a core execution unit configured to execute a state machine, wherein movement from a current state to a next state in the state machine requires M clock cycles, the core execution unit comprising a storage device for storing information indicating one or more boundaries between the M regions, wherein the core execution unit reads a symbol from one of the M storage locations during each clock cycle.
- View Dependent Claims (35, 36, 37, 38, 39, 40, 41)
- - 35. The apparatus of claim 34, wherein each of the M storage locations comprises a buffer.
  - 36. The apparatus of claim 34, wherein a buffer comprises each of the M storage locations.
  - 37. The apparatus of claim 34, wherein the data file comprises M substreams, wherein an i^thsubstream comprises one or more symbols of an i^thregion and one or more symbols of an i+1^stregion.
  - 38. The apparatus of claim 37, wherein the core execution unit is further configured to re-process some symbols in the i+1^stregion in connection with analysis of the i^thsubstream in order to identify a lexeme that crosses a boundary between the i^thand the i+1^stregions.
  - 39. The apparatus of claim 37, wherein the core execution unit is further configured to stop re-processing of symbols in the i+1^stregion in connection with the i^thsubstream (1) after all symbols in the i^thsubstream have been processed and (2) when an output result in re-processing the i+1^stregion in connection with the i^thsubstream is the same as an output result produced by processing an i+1^stsubstream.
  - 40. The apparatus of claim 34, wherein the data file comprises M substreams, wherein an i^thsubstream comprises one or more symbols of an i^thregion and zero or more symbols of an i+1^stregion.
  - 41. The apparatus of claim 34, wherein the apparatus stores indications of each time the core execution unit (1) initiates an output and (2) determines that a start state is going to be entered.

42. A method of representing a state machine, the method comprising:
- (a) determining a number M of out transitions from a Nth state in the state machine;
  
  (b) generating an instruction corresponding to each of the M transitions from the Nth state, wherein each of the instructions includes an indication of a next state in the state machine;
  
  (c) repeating steps a and b for each of the states of the state machine; and
  
  (d) storing at least some of the instructions for each of the states of the state machine in a storage device, wherein the indication of the next state in the one or more instructions is usable to determine an address of the next state in the storage device.
- View Dependent Claims (43, 44, 45, 46, 47, 48, 49, 56)
- - 43. The method of claim 42, wherein for a particular state in the state machine, M-1 of the transitions are failure transitions and the M-1 failure transitions are combined in a single instruction for storage in the storage device.
  - 44. The method of claim 42, wherein the M transitions for the particular state are stored in the storage device.
  - 45. The method of claim 42, wherein an opcode associated with the M transitions from the particular state indicates that the single instruction represents transitions from M-1 states.
  - 46. The method of claim 42, wherein for a particular state in the state machine, M-2 of the transitions are failure transitions and the M-2 failure transitions are combined in a single instruction for storage in the storage device.
  - 47. The method of claim 46, wherein an opcode associated with the M transitions from the particular state indicates that the single instruction represents transitions from M-2 states.
  - 48. The method of claim 42, wherein for a particular state in the state machine, M-P of the transitions are failure transitions and the M-P failure transitions are combined in a single instruction for storage in the storage device.
  - 49. The method of claim 48, wherein an opcode associated with the M transitions from the particular state indicates that the single instruction represents transitions from M-P states.
  - 56. The method of claim 42, wherein at least one of the instructions is a virtual terminal instruction, wherein the virtual terminal instruction includes (a) information indicating an output that corresponds to the state associated with the virtual terminal instruction and (b) information usable to determine a next initial state, and wherein by executing the virtual terminal instruction, a transition is made directly to the next initial state and the output is produced in a single clock cycle.

50. A method of moving between a plurality of states of a state machine, wherein a plurality of instructions indicate transitions between states of the state machine, the method comprising:
- selecting an instruction corresponding to a transition from a first state, wherein the act of selecting is based, at least partly, on one or more current symbol classes;
  
  setting an offset according to one or more of the current symbol classes and one or more fields of the selected instruction;
  
  determining an address of a next state by adding the offset to an address of the selected instruction.
- View Dependent Claims (51, 52, 53, 54, 55)
- - 51. The method of claim 50, wherein the offset is set equal to the current symbol class.
  - 52. The method of claim 50, wherein the offset is set according to a correspondence between one or more elements of the selected instruction and the current symbol classes.
  - 53. The method of claim 50, wherein the offset is set to the value obtained by subtracting an element of the selected instruction from one of the current symbol classes.
  - 54. The method of claim 50, wherein the offset is set to the result of an arithmetic operation performed on one or more of the current symbol classes and one or more elements of the selected instruction
  - 55. The method of claim 50, wherein the offset is set according to one or more of the current symbol classes.

57. A state machine comprising:
- a plurality of instructions, each instruction representing a transition from one state to another state in a state machine; and
  
  a virtual terminal instruction including (a) information indicating an output that corresponds to a state associated with the virtual terminal instruction and (b) information usable to determine a next state, wherein by executing the virtual terminal instruction, the state machine transitions from the state associated with the virtual terminal instruction to the determined next state in a single clock cycle.
- View Dependent Claims (58)
- - 58. The state machine of claim 57, wherein, during the single clock cycle the output is produced.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
LSI Corporation (Broadcom, Inc.)
Original Assignee
LSI Corporation (Broadcom, Inc.)
Inventors
Ruehle, Michael D., McMillen, Robert J.

Application Number

US10/851,482
Publication Number

US 20050273450A1
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06V 10/94 Hardware or software archit...

Regular expression acceleration engine and processing model

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

196 Citations

58 Claims

Specification

Use Cases

Quick Links

Others

Regular expression acceleration engine and processing model

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

196 Citations

58 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others