Managing selection of a representative data subset according to user-specified parameters with clustering

US 10,585,910 B1
Filed: 01/31/2017
Issued: 03/10/2020
Est. Priority Date: 01/22/2013
Status: Active Grant

First Claim

Patent Images

1. A computer implemented method for managing selection of a representative data subset, comprising:

receiving, from a user via a graphical user interface, selections of;

(i) a data source type from which to generate the representative data subset,(ii) one or a combination of subset types, of a plurality of defined event subset types, for identifying events to include in the subset, and(iii) a number of desired representative events to be included in the subset;

retrieving events from the selected data source according to the received selection of subset type;

clustering to identify similarities between the retrieved events to determine whether the particular events can be characterized as forming a group;

extracting from the retrieved, clustered events a number of events corresponding to the user-selected number of desired representative events, wherein the events are extracted based on a field-extraction rule that specifies how to extract values from raw machine data included in each of the one or more events; and

causing display of the subset of representative events in the graphical user interface.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Embodiments are directed towards generating a representative sampling as a subset from a larger dataset that includes unstructured data. A graphical user interface enables a user to provide various data selection parameters, including specifying a data source and one or more subset types desired, including one or more of latest records, earliest records, diverse records, outlier records, and/or random records. Diverse and/or outlier subset types may be obtained by generating clusters from an initial selection of records obtained from the larger dataset. An iteration analysis is performed to determine whether a sufficient number of clusters and/or cluster types have been generated that exceed at least one threshold and when not exceeded, additional clustering is performed on additional records. From the resultant clusters, and/or other subtype results, a subset of records is obtained as the representative sampling subset.

Citations

30 Claims

1. A computer implemented method for managing selection of a representative data subset, comprising:
- receiving, from a user via a graphical user interface, selections of;
  
  (i) a data source type from which to generate the representative data subset,(ii) one or a combination of subset types, of a plurality of defined event subset types, for identifying events to include in the subset, and(iii) a number of desired representative events to be included in the subset;
  
  retrieving events from the selected data source according to the received selection of subset type;
  
  clustering to identify similarities between the retrieved events to determine whether the particular events can be characterized as forming a group;
  
  extracting from the retrieved, clustered events a number of events corresponding to the user-selected number of desired representative events, wherein the events are extracted based on a field-extraction rule that specifies how to extract values from raw machine data included in each of the one or more events; and
  
  causing display of the subset of representative events in the graphical user interface.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein clustering further comprises placing events into a same cluster based on similarities in the machine data in each of the events.
  - 3. The method of claim 1, wherein extracting further comprises selecting events from one or more populous clusters.
  - 4. The method of claim 1, wherein the plurality of defined event subset types corresponds to a plurality of subtype processes that include one or more of a diverse event-identification process, an outlier event-identification process, a random event identification process, an earlier event-identification process, or a later event-identification process.
  - 5. The method of claim 1, wherein clustering further comprises:
    - clustering a group of events in the plurality of events to form a plurality of clusters;
      
      determining that a number of clusters in the plurality of clusters is not of a sufficiently large number; and
      
      clustering a larger group of events in the plurality of events than the group of events.
  - 6. The method of claim 1, wherein each event in the plurality of events is associated with a time stamp.
  - 7. The method of claim 1, wherein each event in the plurality of events is associated with a time stamp that has been extracted from the portion of raw machine data in that event.
  - 8. The method of claim 1, wherein retrieving events from the selected data source according to the received selection of subset types includes using a process to identify outlier events.
  - 9. The method of claim 1, wherein retrieving events from the selected data source according to the received selection of subset types includes using a process to identify events associated with earliest events in the plurality of events.
  - 10. The method of claim 1, wherein retrieving events from the selected data source according to the received selection of subset types includes using a process to identify events associated with latest events in the plurality of events.

11. A non-transitory, computer-readable storage medium storing instructions, an execution of which in a computer system causes the computer system to perform operations comprising:
- receiving, from a user via a graphical user interface, selections of;
  
  (i) a data source type from which to generate the representative data subset,(ii) one or a combination of subset types, of a plurality of defined event subset types, for identifying events to include in the subset, and(iii) a number of desired representative events to be included in the subset;
  
  retrieving events from the selected data source according to the received selection of subset type;
  
  clustering to identify similarities between the retrieved events to determine whether the particular events can be characterized as forming a group;
  
  extracting from the retrieved, clustered events a number of events corresponding to the user-selected number of desired representative events, wherein the events are extracted based on a field-extraction rule that specifies how to extract values from raw machine data included in each of the one or more events; and
  
  causing display of the subset of representative events in the graphical user interface.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 12. The computer-readable storage medium of claim 11, wherein clustering further comprises placing events into a same cluster based on similarities in the machine data in each of the events.
  - 13. The computer-readable storage medium of claim 11, wherein extracting further comprises selecting events from one or more populous clusters.
  - 14. The computer-readable storage medium of claim 11, wherein the plurality of defined event subset types corresponds to a plurality of subtype processes that include one or more of a diverse event-identification process, an outlier event-identification process, a random event identification process, an earlier event-identification process, or a later event-identification process.
  - 15. The computer-readable storage medium of claim 11, wherein clustering further comprises:
    - clustering a group of events in the plurality of events to form a plurality of clusters;
      
      determining that a number of clusters in the plurality of clusters is not of a sufficiently large number; and
      
      clustering a larger group of events in the plurality of events than the group of events.
  - 16. The computer-readable storage medium of claim 11, wherein each event in the plurality of events is associated with a time stamp.
  - 17. The computer-readable storage medium of claim 11, wherein each event in the plurality of events is associated with a time stamp that has been extracted from the portion of raw machine data in that event.
  - 18. The computer-readable storage medium of claim 11, wherein retrieving events from the selected data source according to the received selection of subset types includes using a process to identify outlier events.
  - 19. The computer-readable storage medium of claim 11, wherein retrieving events from the selected data source according to the received selection of subset types includes using a process to identify events associated with earliest events in the plurality of events.
  - 20. The computer-readable storage medium of claim 11, wherein retrieving events from the selected data source according to the received selection of subset types includes using a process to identify events associated with latest events in the plurality of events.

21. A computer system comprising:
- computer memory for storing machine data; and
  
  a processor for;
  
  receiving, from a user via a graphical user interface, selections of;
  
  (i) a data source type from which to generate the representative data subset,(ii) one or a combination of subset types, of a plurality of defined event subset types, for identifying events to include in the subset, and(iii) a number of desired representative events to be included in the subset;
  
  retrieving events from the selected data source according to the received selection of subset type;
  
  clustering to identify similarities between the retrieved events to determine whether the particular events can be characterized as forming a group;
  
  extracting from the retrieved, clustered events a number of events corresponding to the user-selected number of desired representative events, wherein the events are extracted based on a field-extraction rule that specifies how to extract values from raw machine data included in each of the one or more events; and
  
  causing display of the subset of representative events in the graphical user interface.
- View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30)
- - 22. The computer system of claim 21, wherein clustering further comprises placing events into a same cluster based on similarities in the machine data in each of the events.
  - 23. The computer system of claim 21, wherein extracting further comprises selecting events from one or more populous clusters.
  - 24. The computer system of claim 21, wherein the plurality of defined event subset types corresponds to a plurality of subtype processes that include one or more of a diverse event-identification process, an outlier event-identification process, a random event identification process, an earlier event-identification process, or a later event-identification process.
  - 25. The computer system of claim 21, wherein clustering further comprises:
    - clustering a group of events in the plurality of events to form a plurality of clusters;
      
      determining that a number of clusters in the plurality of clusters is not of a sufficiently large number; and
      
      clustering a larger group of events in the plurality of events than the group of events.
  - 26. The computer system of claim 21, wherein each event in the plurality of events is associated with a time stamp.
  - 27. The computer system of claim 21, wherein each event in the plurality of events is associated with a time stamp that has been extracted from the portion of raw machine data in that event.
  - 28. The computer system of claim 21, wherein retrieving events from the selected data source according to the received selection of subset types includes using a process to identify outlier events.
  - 29. The computer system of claim 21, wherein retrieving events from the selected data source according to the received selection of subset types includes using a process to identify events associated with earliest events in the plurality of events.
  - 30. The computer system of claim 21, wherein retrieving events from the selected data source according to the received selection of subset types includes using a process to identify events associated with latest events in the plurality of events.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Splunk Inc. (Cisco Systems, Inc.)
Original Assignee
Splunk Inc. (Cisco Systems, Inc.)
Inventors
Carasso, R. David, Delfino, Micah James
Primary Examiner(s)
Ly, Anh

Application Number

US15/421,406
Time in Patent Office

1,134 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/254   Extract, transform and load...

G06F 16/287   Visualization; Browsing

G06F 16/35   Clustering; Classification

G06F 16/904   Browsing; Visualisation the...

G06F 3/0482   Interaction with lists of s...

G06F 3/04842   Selection of displayed obje...

G06F 3/0488   using a touch-screen or dig...

G06F 7/24   Sorting, i.e. extracting da...

Managing selection of a representative data subset according to user-specified parameters with clustering

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

Managing selection of a representative data subset according to user-specified parameters with clustering

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links