METHOD AND SYSTEM FOR CLUSTERING, MODELING, AND VISUALIZING PROCESS MODELS FROM NOISY LOGS

US 20150142707A1
Filed: 11/15/2013
Published: 05/21/2015
Est. Priority Date: 11/15/2013
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented process discovery method, comprising:

receiving as input at least one noisy log file that contains a plurality of labeled log traces from a plurality of process models;

clustering similar log traces using non-negative matrix factorization (NMF) into a plurality of clusters, wherein each cluster represents a different process model;

learning a Conditional Random Field (CRF) model for each of the process models;

decoding new incoming log traces; and

constructing a tunable process graph, wherein one or more transitions are shown or hidden according to a tuning parameter.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A process discovery system that includes an offline system training module configured to cluster similar process log traces using Non-negative Matrix Factorization (NMF) with each cluster representing a process model, and learn a Conditional Random Field (CRF) model for each process model and an online system usage module configured to decode new incoming log traces and construct a process graph in which transitions are shown or hidden according to a tuning parameter.

Citations

19 Claims

1. A computer-implemented process discovery method, comprising:
- receiving as input at least one noisy log file that contains a plurality of labeled log traces from a plurality of process models;
  
  clustering similar log traces using non-negative matrix factorization (NMF) into a plurality of clusters, wherein each cluster represents a different process model;
  
  learning a Conditional Random Field (CRF) model for each of the process models;
  
  decoding new incoming log traces; and
  
  constructing a tunable process graph, wherein one or more transitions are shown or hidden according to a tuning parameter.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The computer-implemented process discovery method of claim 1, wherein clustering similar log traces further comprises decomposing a term-document matrix into at least a term-cluster matrix and a cluster-document matrix.
  - 3. The computer-implemented process discovery method of claim 1, wherein the CRF learns to classify a sequence of activities that comprise a process model by associating an activity entry in a log trace to an activity label at least according to one or more features and a previous activity.
  - 4. The computer-implemented process discovery method of claim 1, wherein learning a CRF model for each of the process models further comprises:
    - associating a TF-IDF vector for at least one cluster and for the entries in a log trace by assigning a label to each activity log entry according to a reference annotation, wherein one or more features of the vector comprise one or more words occurring in the entry, and for each feature computing a TF-IDF score by taking into account substantially all the activity log entries in the cluster only, and adding a Boolean feature such as the name of the previous activity;
      
      generating one or more feature matrices; and
      
      training a CRF for each feature matrix.
  - 5. The computer-implemented process discovery method of claim 1, further comprising providing a visualization of discovered process models by transforming a probabilistic activity transition matrix into a footprint matrix directly usable by a α
    - +-algorithm.
  - 6. The computer-implemented process discovery method of claim 1, wherein the tunable process graph comprises a visual representation of discovered process models associated with the learned CRFs and includes at least a plurality of nodes representing activities, a plurality of arrows representing transitions, and one or more “
    - OR”
      
      or “
      
      AND”
      
      gateways.
  - 7. The computer-implemented process discovery method of claim 1, wherein decoding new incoming log traces further comprises:
    - submitting the incoming log traces into the learned CRFs to obtain a matching probability and a decoding of the incoming log traces, wherein the incoming log traces include activity log entries;
      
      the CRFs classifying a sequence of feature vectors that correspond to a sequence of activities in the incoming traces;
      
      the CRFs labeling each activity log entry with an activity name and assigning a particular likelihood score to each of the sequences of activities according to the learned models;
      
      ranking likelihood scores calculated by each CRF;
      
      generating as output the process model that generated the trace and the activity names corresponding to each activity entry in the trace.
  - 8. The computer-implemented process discovery method of claim 1, wherein the tunable process graph is tuned using a [0,1] parameter that controls the level of transition rates, wherein when the parameter is close to 1, highly probable transitions are shown to the user and when the parameter is close to 0, transitions with low probabilities are visible.

9. A process discovery system comprising:
- an offline system training module configured to receive as input at least one noisy log file that contains a plurality of labeled log traces from a plurality of process models, cluster similar log traces using Non-negative Matrix Factorization (NMF) with each cluster representing a different process model, and learn a Conditional Random Field (CRF) model for each process model;
  
  an online system usage module configured to decode new incoming log traces and to construct a tunable process graph in which transitions are shown or hidden according to a tuning parameter.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The process discovery system of claim 9, wherein the offline system training module is further configured to cluster similar log traces by decomposing a term-document matrix into at least a term-cluster matrix and a cluster-document matrix.
  - 11. The process discovery system of claim 9, wherein the CRF learns to classify a sequence of activities that comprise a process model by associating an activity entry in a log trace to an activity label at least according to one or more features and a previous activity.
  - 12. The process discovery system of claim 9, wherein the offline system training module is further configured to learn a CRF model for each of the process models by:
    - associating a TF-IDF vector for at least one cluster and for the entries in a log trace by assigning a label to each activity log entry according to a reference annotation, wherein one or more features of the vector comprise one or more words occurring in the entry, and for each feature computing a TF-IDF score by taking into account substantially all the activity log entries in the cluster only, and adding a Boolean feature such as the name of the previous activity;
      
      generating one or more feature matrices; and
      
      training a CRF for each feature matrix.
  - 13. The process discovery system of claim 9, wherein the online usage system module is further configured to provide a visualization of discovered process models by transforming a probabilistic activity transition matrix into a footprint matrix directly usable by a α
    - +-algorithm.
  - 14. The process discovery system of claim 9 wherein the tunable process graph comprises a visual representation of discovered process models associated with the learned CRFs and includes at least a plurality of nodes representing activities, a plurality of arrows representing transitions, and one or more “
    - OR”
      
      or “
      
      AND”
      
      gateways.
  - 15. The process discovery system of claim 9, wherein the online system usage module is further configured to decode new incoming log traces by:
    - submitting the incoming log traces into the learned CRFs to obtain a matching probability and a decoding of the incoming log traces, wherein the incoming log traces include activity log entries;
      
      the CRFs classifying a sequence of feature vectors that correspond to a sequence of activities in the incoming traces;
      
      the CRFs labeling each activity log entry with an activity name and assigning a particular likelihood score to each of the sequences of activities according to the learned models;
      
      ranking likelihood scores calculated by each CRF;
      
      generating as output the process model that generated the trace and the activity names corresponding to each activity entry in the trace.
  - 16. The process discovery system of claim 9, wherein the tunable process graph is tuned using a [0,1] parameter that controls the level of transition rates, wherein when the parameter is close to 1, highly probable transitions are shown to the user and when the parameter is close to 0, transitions with low probabilities are visible.

17. A computer-implemented process discovery method comprising:
- receiving as input at least one noisy log file that contains a plurality of labeled trace activity log entries from a plurality of process models, wherein each trace in the log comprises a document;
  
  calculating a term frequency-inverse document frequency (TF-IDF) vector score for each document in the log file, wherein words appearing in the document comprise the features of a vector for which the TF-IDF vector score is calculated;
  
  obtaining a term-document matrix, wherein each cell contains the TF-IDF score of a given term in a given document;
  
  applying non-negative matrix factorization (NMF) to cluster similar documents;
  
  obtaining a plurality of clusters of noisy process documents via NMF, wherein each cluster contains the documents of different instances of the same process model.for each cluster and for each activity log entry in a document, associating a TF-IDF vector is performed as follows;
  
  a label for each activity log entry is assigned according to a reference annotation;
  
  the features of the vector are words occurring in the entry;
  
  for each feature, a TF-IDF score is computed by taking into account all the entries in this cluster only;
  
  a Boolean feature comprising the name of the previous activity is added;
  
  computing feature matrices, wherein the feature matrices comprise term-document matrices in which each document is a trace activity entry and is augmented with at least one Boolean feature that represents the previous activity;
  
  training a conditional random field (CRF);
  
  obtaining as output a plurality of CRFs, wherein each CRF is configured to model one or more transition probabilities between activities of one process model;
  
  storing a plurality of inverse document frequency (IDF) vectors of terms, wherein each vector is the size of a feature vocabulary for a given cluster.
- View Dependent Claims (18, 19)
- - 18. The computer-implemented process discovery method of claim 17, further comprising:
    - receiving as input at least one noisy log file including a plurality of labeled trace activity log entries generated by an unknown process;
      
      for each CRF transforming each activity entry of the incoming trace into a feature vector substantially identical to the vector used for training the CRF;
      
      classifying the sequence of feature vectors that correspond to the sequence of activities in the incoming trace and labeling each activity entry with an activity name and assigning a particular likelihood to this sequence of activities according to its learned model;
      
      ranking the likelihood scores calculated by each CRF, wherein the highest likelihood reflects the right classification for the trace;
      
      obtaining as output the process model that generated the trace.
  - 19. The computer-implemented process discovery method of claim 18, further comprising:
    - receiving as input the learned CRF models;
      
      extracting from at least one CRF model, a three dimensional (3D) activity transition matrix that models transition probabilities from a first activity X to a second activity Y given a previous activity Z;
      
      reducing the 3D activity transition matrix to a two dimensional (2D) transition matrix by marginalizing it on Z;
      
      transforming the probability matrix into a footprint matrix as defined by an α
      
      +-algorithm as follows;
      
      If P(a,b)≧
      
      H, and P(b,a)≦
      
      L→
      
      a>
      
      b (denotes that a precedes b)If P(a,b)≦
      
      L, and P(b,a)≧
      
      H→
      
      a<
      
      b (denotes that b precedes a)If P(a,b)≧
      
      H, and P(b,a)≧
      
      L→
      
      a∥
      
      b (denotes that a and b are parallel)If P(a,b)≦
      
      L, and P(b,a)≦
      
      L→
      
      a#b (denotes that a and b never rarely transition from one to the other)where H and L are respectively high and low transition probabilities thresholds that can be given by the user;
      
      applying the α
      
      +-algorithm to construct from this matrix a process graph;
      
      receiving as output a process graph that is tunable with a tuning parameter using the transition probabilities kept in the 2D transition matrices.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Xerox Corporation (Xerox Holdings Corp.)
Original Assignee
Xerox Corporation (Xerox Holdings Corp.)
Inventors
Charif, Yasmine, BOURDAILLET, JULIEN JEAN LUCIEN, VANDERVORT, DAVID RUSSELL, KEHOE, MICHAEL P., KATARIA, SAURABH

Granted Patent

US 9,324,038 B2
Time in Patent Office

Days
Field of Search
US Class Current

706/12
CPC Class Codes

G06N 20/00 Machine learning

G06Q 10/0639 Performance analysis of emp...

METHOD AND SYSTEM FOR CLUSTERING, MODELING, AND VISUALIZING PROCESS MODELS FROM NOISY LOGS

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

METHOD AND SYSTEM FOR CLUSTERING, MODELING, AND VISUALIZING PROCESS MODELS FROM NOISY LOGS

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links