Real-time categorization of log events
First Claim
1. A method for categorizing a real-time log event, the method comprising:
- computing a Term Frequency-Inverse Document Frequency (TF-IDF) matrix of a log corpus based on a number of pre-existing log events in the log corpus and a number of words in the log corpus;
computing a Term Frequency-Inverse Document Frequency (TF-IDF) vector for the real-time log event based on a pre-calculated TF-IDF matrix of the log corpus and a number of new words in the real-time log event, wherein the log corpus comprises one or more pre-existing log events, and wherein the real-time log event is indicative of an error message;
generating a cluster model based on the TF-IDF matrix, wherein the cluster model is indicative of a number of clusters corresponding to the log corpus, and wherein a cluster is indicative of a log category;
determining a centroid matrix of the log corpus based on the number of clusters in the cluster model and the number of words in the log corpus;
calculating a cluster radius and a silhouette width of each cluster, wherein the cluster radius of a cluster is calculated based on a distance between a cluster centroid of the cluster and a farthest point in the cluster; and
wherein the silhouette width of the cluster is indicative of compactness of the cluster;
determining a silhouette threshold for each cluster based on the corresponding cluster radius and the corresponding silhouette width;
calculating a distance between the TF-IDF vector and the cluster centroid of each cluster in the log corpus;
identifying, from amongst the clusters, a cluster having a closest cluster centroid based on the distance between the TF-IDF vector and the cluster centroid of each of the clusters, wherein the closest cluster centroid is a cluster centroid closest to the TF-IDF vector; and
categorizing the real-time log event into one or more log categories based on a comparison of the distance between the TF-IDF vector and the closest cluster centroid with a pre-determined silhouette threshold corresponding to the cluster with the closest cluster centroid.
1 Assignment
0 Petitions
Accused Products
Abstract
Embodiments for categorizing a real-time log event are described. In one example, a Term Frequency-Inverse Document Frequency (TF-IDF) vector for the log event is computed based on pre-calculated TF-IDF matrix of log corpus and number of new words in log event, where log corpus comprises one or more pre-existing log events, and where the log event is indicative of error message. Further, distance between TF-IDF vector and cluster centroid of each cluster in the log corpus is calculated. Thereafter, cluster having closest cluster centroid is identified from amongst the clusters based on distance between TF-IDF vector and cluster centroid of each of the clusters, where closest cluster centroid is cluster centroid closest to TF-IDF vector. Subsequently, log event is categorized into one or more log categories based on comparison of distance between TF-IDF vector and closest cluster centroid pre-determined silhouette threshold corresponding to cluster with closest cluster centroid.
18 Citations
16 Claims
-
1. A method for categorizing a real-time log event, the method comprising:
-
computing a Term Frequency-Inverse Document Frequency (TF-IDF) matrix of a log corpus based on a number of pre-existing log events in the log corpus and a number of words in the log corpus; computing a Term Frequency-Inverse Document Frequency (TF-IDF) vector for the real-time log event based on a pre-calculated TF-IDF matrix of the log corpus and a number of new words in the real-time log event, wherein the log corpus comprises one or more pre-existing log events, and wherein the real-time log event is indicative of an error message; generating a cluster model based on the TF-IDF matrix, wherein the cluster model is indicative of a number of clusters corresponding to the log corpus, and wherein a cluster is indicative of a log category; determining a centroid matrix of the log corpus based on the number of clusters in the cluster model and the number of words in the log corpus; calculating a cluster radius and a silhouette width of each cluster, wherein the cluster radius of a cluster is calculated based on a distance between a cluster centroid of the cluster and a farthest point in the cluster; and
wherein the silhouette width of the cluster is indicative of compactness of the cluster;determining a silhouette threshold for each cluster based on the corresponding cluster radius and the corresponding silhouette width; calculating a distance between the TF-IDF vector and the cluster centroid of each cluster in the log corpus; identifying, from amongst the clusters, a cluster having a closest cluster centroid based on the distance between the TF-IDF vector and the cluster centroid of each of the clusters, wherein the closest cluster centroid is a cluster centroid closest to the TF-IDF vector; and categorizing the real-time log event into one or more log categories based on a comparison of the distance between the TF-IDF vector and the closest cluster centroid with a pre-determined silhouette threshold corresponding to the cluster with the closest cluster centroid. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A log categorization system for categorizing a real-time log event, the log categorization system comprising:
-
a processor; a clustering module coupled to the processor to, computer a Term Frequency-Inverse Document Frequency (TF-IDF) matrix of a log corpus based on a number of pre-existing log events in the log corpus and a number of words in the log corpus; compute a Term Frequency-Inverse Document Frequency (TF-IDF) vector for the real-time log event based on a pre-calculated TF-IDF matrix of a log corpus and a number of new words in the real-time log event, wherein the log corpus comprises one or more pre-existing log events, and wherein the real-time log event is indicative of an error message; generate a cluster model based on the TF-IDF matrix, wherein the cluster model is indicative of the number of clusters corresponding to the log corpus, and wherein a cluster is indicative of a log category; and determine the centroid matrix of the log corpus based on the number of clusters in the cluster model and the number of words in the log corpus; a log categorization module coupled to the processor to, calculate a cluster radius and a silhouette width of each cluster, wherein the cluster radius of a cluster is calculated based on a distance between a cluster centroid of the cluster and a farthest point in the cluster; and
wherein the silhouette width of the cluster is indicative of compactness of the cluster;determine a silhouette threshold for each cluster based on the corresponding cluster radius and the corresponding silhouette width; calculate a distance between the TF-IDF vector and a cluster centroid of each cluster in the log corpus; identify, from amongst the clusters, a cluster having a closest cluster centroid based on the distance between the TF-IDF vector and the cluster centroid of each of the clusters, wherein the closest cluster centroid is a cluster centroid closest to the TF-IDF vector; and categorize the real-time log event into a log category based on a comparison of the distance between the TF-IDF vector and the closest cluster centroid with a pre-determined silhouette threshold corresponding to the cluster with the closest cluster centroid. - View Dependent Claims (10, 11, 12, 13, 14, 15)
-
-
16. A non-transitory computer-readable medium having embodied thereon a computer program for executing a method comprising:
-
computing a Term Frequency-Inverse Document Frequency (TF-IDF) matrix of a log corpus based on a number of pre-existing log events in the log corpus and a number of words in the log corpus; computing a Term Frequency-Inverse Document Frequency (TF-IDF) vector for a log event based on a pre-calculated TF-IDF matrix of the log corpus and a number of new words in the log event, wherein the log corpus comprises one or more pre-existing log events, and wherein the log event is indicative of an error message; generating a cluster model based on the TF-IDF matrix, wherein the cluster model is indicative of a number of clusters corresponding to the log corpus, and wherein a cluster is indicative of a log category; determining a centroid matrix of the log corpus based on the number of clusters in the cluster model and the number of words in the log corpus; calculating a cluster radius and a silhouette width of each cluster, wherein the cluster radius of a cluster is calculated based on a distance between a cluster centroid of the cluster and a farthest point in the cluster; and
wherein the silhouette width of the cluster is indicative of compactness of the cluster;determining a silhouette threshold for each cluster based on the corresponding cluster radius and the corresponding silhouette width; calculating a distance between the TF-IDF vector and a cluster centroid of each cluster in the log corpus; identifying, from amongst the clusters, a cluster having a closest cluster centroid based on the distance between the TF-IDF vector and the cluster centroid of each of the clusters, wherein the closest cluster centroid is a cluster centroid closest to the TF-IDF vector; and categorizing the log event into one or more log categories based on a comparison of the distance between the TF-IDF vector and the closest cluster centroid with a pre-determined silhouette threshold corresponding to the cluster with the closest cluster centroid.
-
Specification