REAL-TIME CATEGORIZATION OF LOG EVENTS
First Claim
1. A method for categorizing a real-time log event, the method comprising:
- computing a Term Frequency-Inverse Document Frequency (TF-IDF) vector for the real-time log event based on a pre-calculated TF-IDF matrix of a log corpus and a number of new words in the real-time log event, wherein the log corpus comprises one or more pre-existing log events, and wherein the real-time log event is indicative of an error message;
calculating a distance between the TF-IDF vector and a cluster centroid of each cluster in the log corpus;
identifying, from amongst the clusters, a cluster having a closest cluster centroid based on the distance between the TF-IDF vector and the cluster centroid of each of the clusters, wherein the closest cluster centroid is a cluster centroid closest to the TF-IDF vector; and
categorizing the real-time log event into one or more log categories based on a comparison of the distance between the TF-IDF vector and the closest cluster centroid with a pre-determined silhouette threshold corresponding to the cluster with the closest cluster centroid.
1 Assignment
0 Petitions
Accused Products
Abstract
Embodiments for categorizing a real-time log event are described. In one example, a Term Frequency-Inverse Document Frequency (TF-IDF) vector for the log event is computed based on pre-calculated TF-IDF matrix of log corpus and number of new words in log event, where log corpus comprises one or more pre-existing log events, and where the log event is indicative of error message. Further, distance between TF-IDF vector and cluster centroid of each cluster in the log corpus is calculated. Thereafter, cluster having closest cluster centroid is identified from amongst the clusters based on distance between TF-IDF vector and cluster centroid of each of the clusters, where closest cluster centroid is cluster centroid closest to TF-IDF vector. Subsequently, log event is categorized into one or more log categories based on comparison of distance between TF-IDF vector and closest cluster centroid pre-determined silhouette threshold corresponding to cluster with closest cluster centroid.
-
Citations
18 Claims
-
1. A method for categorizing a real-time log event, the method comprising:
-
computing a Term Frequency-Inverse Document Frequency (TF-IDF) vector for the real-time log event based on a pre-calculated TF-IDF matrix of a log corpus and a number of new words in the real-time log event, wherein the log corpus comprises one or more pre-existing log events, and wherein the real-time log event is indicative of an error message; calculating a distance between the TF-IDF vector and a cluster centroid of each cluster in the log corpus; identifying, from amongst the clusters, a cluster having a closest cluster centroid based on the distance between the TF-IDF vector and the cluster centroid of each of the clusters, wherein the closest cluster centroid is a cluster centroid closest to the TF-IDF vector; and categorizing the real-time log event into one or more log categories based on a comparison of the distance between the TF-IDF vector and the closest cluster centroid with a pre-determined silhouette threshold corresponding to the cluster with the closest cluster centroid. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A log categorization system (102) for categorizing a real-time log event, the log categorization system (102) comprising:
-
a processor (104); a clustering module (116) coupled to the processor (104) to, compute a Term Frequency-Inverse Document Frequency (TF-IDF) vector for the real-time log event based on a pre-calculated TF-IDF matrix of a log corpus and a number of new words in the real-time log event, wherein the log corpus comprises one or more pre-existing log events, and wherein the real-time log event is indicative of an error message; a log categorization module (120) coupled to the processor (104) to, calculate a distance between the TF-IDF vector and a cluster centroid of each cluster in the log corpus; identify, from amongst the clusters, a cluster having a closest cluster centroid based on the distance between the TF-IDF vector and the cluster centroid of each of the clusters, wherein the closest cluster centroid is a cluster centroid closest to the TF-IDF vector; and categorize the real-time log event into a log category based on a comparison of the distance between the TF-IDF vector and the closest cluster centroid with a pre-determined silhouette threshold corresponding to the cluster with the closest cluster centroid. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. A non-transitory computer-readable medium having embodied thereon a computer program for executing a method comprising:
-
computing a Term Frequency-Inverse Document Frequency (TF-IDF) vector for a log event based on a pre-calculated TF-IDF matrix of a log corpus and a number of new words in the log event, wherein the log corpus comprises one or more pre-existing log events, and wherein the log event is indicative of an error message; calculating a distance between the TF-IDF vector and a cluster centroid of each cluster in the log corpus; identifying, from amongst the clusters, a cluster having a closest cluster centroid based on the distance between the TF-IDF vector and the cluster centroid of each of the clusters, wherein the closest cluster centroid is a cluster centroid closest to the TF-IDF vector; and categorizing the log event into one or more log categories based on a comparison of the distance between the TF-IDF vector and the closest cluster centroid with a pre-determined silhouette threshold corresponding to the cluster with the closest cluster centroid.
-
Specification