EXTRACTING INFORMATION FROM UNSTRUCTURED DATA AND MAPPING THE INFORMATION TO A STRUCTURED SCHEMA USING THE NAÏVE BAYESIAN PROBABILITY MODEL
First Claim
1. A computer-implemented method for classifying a token of an unstructured event according to one of a plurality of fields of a schema, comprising:
- using a Naï
ve Bayesian classifier, wherein features of the token include one or more of Value, Type, Index, and Previous Word, and wherein probabilities of feature occurrences are based on statistics collected regarding other events whose tokens have been deterministically mapped to the fields of the schema.
10 Assignments
0 Petitions
Accused Products
Abstract
An “unstructured event parser” analyzes an event that is in unstructured form and generates an event that is in structured form. A mapping phase determines, for a given event token, possible fields of the structured event schema to which the token could be mapped and the probabilities that the token should be mapped to those fields. Particular tokens are then mapped to particular fields of the structured event schema. By using the Naïve Bayesian probability model, a “probabilistic mapper” determines, for a particular token and a particular field, the probability that that token maps to that field. The probabilistic mapper can also be used in a “regular expression creator” that generates a regex that matches an unstructured event and a “parameter file creator” that helps a user create a parameter file for use with a parameterized normalized event generator to generate a normalized event based on an unstructured event.
123 Citations
7 Claims
-
1. A computer-implemented method for classifying a token of an unstructured event according to one of a plurality of fields of a schema, comprising:
using a Naï
ve Bayesian classifier, wherein features of the token include one or more of Value, Type, Index, and Previous Word, and wherein probabilities of feature occurrences are based on statistics collected regarding other events whose tokens have been deterministically mapped to the fields of the schema.
-
2. A computer-implemented method for generating a normalized event that adheres to a normalized schema, comprising:
-
receiving an unstructured event; dividing the unstructured event into a plurality of tokens; for each token, determining a value for each feature within a set of features, wherein the set of features includes a Value feature and a Type feature, and wherein a value of a token'"'"'s Type feature is determined based on a value of the token'"'"'s Value feature; and for each token; for each field of the normalized schema, determining a probability that the token maps to the field; responsive to the token'"'"'s Type feature having a value other than Unknown or Word; determining the field of the normalized schema with the highest probability; mapping the token to the determined field; determining a value of the determined field based on the value of the token'"'"'s Value feature; and setting the determined field of a normalized event to the determined value; and responsive to the token'"'"'s Type feature having a value of Word and responsive to the highest probability exceeding a threshold; determining the field of the normalized schema with the highest probability; mapping the token to the determined field; determining a value of the determined field based on the value of the token'"'"'s Value feature; and setting the determined field of the normalized event to the determined value. - View Dependent Claims (3, 4, 5)
-
-
6. A computer program product for generating a normalized event that adheres to a normalized schema, wherein the computer program product is stored on a computer-readable medium that includes instructions that, when loaded into memory, cause a processor to perform a method, the method comprising:
-
receiving an unstructured event; dividing the unstructured event into a plurality of tokens; for each token, determining a value for each feature within a set of features, wherein the set of features includes a Value feature and a Type feature, and wherein a value of a token'"'"'s Type feature is determined based on a value of the token'"'"'s Value feature; and for each token; for each field of the normalized schema, determining a probability that the token maps to the field; responsive to the token'"'"'s Type feature having a value other than Unknown or Word; determining the field of the normalized schema with the highest probability; mapping the token to the determined field; determining a value of the determined field based on the value of the token'"'"'s Value feature; and setting the determined field of a normalized event to the determined value; and responsive to the token'"'"'s Type feature having a value of Word and responsive to the highest probability exceeding a threshold; determining the field of the normalized schema with the highest probability; mapping the token to the determined field; determining a value of the determined field based on the value of the token'"'"'s Value feature; and setting the determined field of the normalized event to the determined value.
-
-
7. A system for generating a normalized event that adheres to a normalized schema, the system comprising:
-
a computer-readable medium that includes instructions that, when loaded into memory, cause a processor to perform a method, the method comprising; receiving an unstructured event; dividing the unstructured event into a plurality of tokens; for each token, determining a value for each feature within a set of features, wherein the set of features includes a Value feature and a Type feature, and wherein a value of a token'"'"'s Type feature is determined based on a value of the token'"'"'s Value feature; and for each token; for each field of the normalized schema, determining a probability that the token maps to the field; responsive to the token'"'"'s Type feature having a value other than Unknown or Word; determining the field of the normalized schema with the highest probability; mapping the token to the determined field; determining a value of the determined field based on the value of the token'"'"'s Value feature; and setting the determined field of a normalized event to the determined value; and responsive to the token'"'"'s Type feature having a value of Word and responsive to the highest probability exceeding a threshold; determining the field of the normalized schema with the highest probability; mapping the token to the determined field; determining a value of the determined field based on the value of the token'"'"'s Value feature; and setting the determined field of the normalized event to the determined value; and a processor for performing the method.
-
Specification