Field content based pattern generation for heterogeneous logs
First Claim
1. A system for pattern discovery in input heterogeneous logs having unstructured text content and one or more fields, the system comprising:
- a memory; and
a processor in communication with the memory, wherein the processor runs program code to;
preprocess the input heterogeneous logs to obtain pre-processed logs by splitting the input heterogeneous logs into tokens;
generate seed patterns from the preprocessed logs; and
generate final patterns by specializing a selected set of fields in each of the seed patterns to generate a final pattern set;
wherein the processor generates the seed patterns by running program code to;
identify semantics of the tokens by assigning one of a plurality of semantic datatypes to the tokens based on Regular Expression rules;
generate seed-pattern signatures, wherein a seed-pattern signature is generated for each of the heterogeneous input logs by position-wise concatenating the semantic datatypes of the tokens therein with spaces; and
identify unique seed-pattern signatures from the seed-pattern signatures using an index, wherein each index entry includes the seed-pattern signature as an index key and associated metadata obtained as a counter value as an index value;
wherein the processor generates the seed patterns by running code to;
search the index for a given seed-pattern signature;
discard the given seed-pattern signature responsive to a matching one being found in the index and increasing the counter value; and
add the given seed-pattern signature to a database of seed-pattern signatures responsive to an absence of the matching one in the index.
2 Assignments
0 Petitions
Accused Products
Abstract
A system and method are provided for pattern discovery in input heterogeneous logs having unstructured text content and one or more fields. The system includes a memory. The system further includes a processor in communication with the memory. The processor runs program code to preprocess the input heterogeneous logs to obtain pre-processed logs by splitting the input heterogeneous logs into tokens. The processor runs program code to generate seed patterns from the preprocessed logs. The processor runs program code to generate final patterns by specializing a selected set of fields in each of the seed patterns to generate a final pattern set.
19 Citations
8 Claims
-
1. A system for pattern discovery in input heterogeneous logs having unstructured text content and one or more fields, the system comprising:
-
a memory; and a processor in communication with the memory, wherein the processor runs program code to; preprocess the input heterogeneous logs to obtain pre-processed logs by splitting the input heterogeneous logs into tokens; generate seed patterns from the preprocessed logs; and generate final patterns by specializing a selected set of fields in each of the seed patterns to generate a final pattern set; wherein the processor generates the seed patterns by running program code to; identify semantics of the tokens by assigning one of a plurality of semantic datatypes to the tokens based on Regular Expression rules; generate seed-pattern signatures, wherein a seed-pattern signature is generated for each of the heterogeneous input logs by position-wise concatenating the semantic datatypes of the tokens therein with spaces; and identify unique seed-pattern signatures from the seed-pattern signatures using an index, wherein each index entry includes the seed-pattern signature as an index key and associated metadata obtained as a counter value as an index value; wherein the processor generates the seed patterns by running code to; search the index for a given seed-pattern signature; discard the given seed-pattern signature responsive to a matching one being found in the index and increasing the counter value; and add the given seed-pattern signature to a database of seed-pattern signatures responsive to an absence of the matching one in the index.
-
-
2. A system for pattern discovery in input heterogeneous logs having unstructured text content and one or more fields, the system comprising:
-
a memory; and a processor in communication with the memory, wherein the processor runs program code to; preprocess the input heterogeneous logs to obtain pre-processed logs by splitting the input heterogeneous logs into tokens; generate seed patterns from the preprocessed logs; and generate final patterns by specializing a selected set of fields in each of the seed patterns to generate a final pattern set; wherein the processor generates the seed patterns by running program code to; identify semantics of the tokens by assigning one of a plurality of semantic datatypes to the tokens based on Regular Expression rules; generate seed-pattern signatures, wherein a seed-pattern signature is generated for each of the heterogeneous input logs by position-wise concatenating the semantic datatypes of the tokens therein with spaces; and identify unique seed-pattern signatures from the seed-pattern signatures using an index, wherein each index entry includes the seed-pattern signature as an index key and associated metadata obtained as a counter value as an index value; wherein the processor generates the seed patterns by running code to generate a single seed-pattern for every seed-pattern signature in the index.
-
-
3. A system for pattern discovery in input heterogeneous logs having unstructured text content and one or more fields, the system comprising:
-
a memory; and a processor in communication with the memory, wherein the processor runs program code to; preprocess the input heterogeneous logs to obtain pre-processed logs by splitting the input heterogeneous logs into tokens; generate seed patterns from the preprocessed logs; and generate final patterns by specializing a selected set of fields in each of the seed patterns to generate a final pattern set; wherein the processor generates the final patterns by running the program code to parse the preprocessed logs using the seed patterns to obtain parsed logs. - View Dependent Claims (4)
-
-
5. A system for pattern discovery in input heterogeneous logs having unstructured text content and one or more fields, the system comprising:
-
a memory; and a processor in communication with the memory, wherein the processor runs program code to; preprocess the input heterogeneous logs to obtain pre-processed logs by splitting the input heterogeneous logs into tokens; generate seed patterns from the preprocessed logs; and generate final patterns by specializing a selected set of fields in each of the seed patterns to generate a final pattern set; wherein the processor generates the final patterns from the seed patterns by running program code to select from among a plurality of pattern specializing settings selected from the group consisting of a low setting, a medium setting, and a high setting. - View Dependent Claims (6)
-
-
7. A system for pattern discovery in input heterogeneous logs having unstructured text content and one or more fields, the system comprising:
-
a memory; and a processor in communication with the memory, wherein the processor runs program code to; preprocess the input heterogeneous logs to obtain pre-processed logs by splitting the input heterogeneous logs into tokens; generate seed patterns from the preprocessed logs; and generate final patterns by specializing a selected set of fields in each of the seed patterns to generate a final pattern set; wherein multiple ones of the tokens of a given field are concatenated using a specialized connector configured such that the concatenated multiples ones of the tokens are processed as a single token by a pattern generator used by the processor to generate the seed patterns.
-
-
8. A system for pattern discovery in input heterogeneous logs having unstructured text content and one or more fields, the system comprising:
-
a memory; and a processor in communication with the memory, wherein the processor runs program code to; preprocess the input heterogeneous logs to obtain pre-processed logs by splitting the input heterogeneous logs into tokens; generate seed patterns from the preprocessed logs; and generate final patterns by specializing a selected set of fields in each of the seed patterns to generate a final pattern set; wherein the final patterns are generated as GROK patterns having a form that includes a syntax component and a semantic component, the syntax component denoting a pattern name to Regular Expressions text matching methodology and the semantic component denoting an identifier for a Regular Expressions text being matched.
-
Specification