Interactive visualization of big data sets and models including textual data

US 9,501,540 B2
Filed: 09/25/2014
Issued: 11/22/2016
Est. Priority Date: 11/04/2011
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

accessing a set of sample data instances, each instance comprising a corresponding value for at least some of a plurality of data fields and at least one of the data fields characterized as a text data type;

processing the sample data instances so as to form a dataset, the processing including analyzing the sample data instances to recognize a data type for each of the plurality of data fields, the recognition including selecting the data type from a predetermined set of data types that includes at least a numeric data type, a categorical data type, and a text data type;

generating a visual summary of the dataset on a computing device, the visual summary comprising a tabular presentation including a series of rows and columns of information, each row corresponding to one of the data fields of the sample data, and each column displaying a corresponding parameter in each of the rows, wherein the displayed column parameters include a data field name, a type of the data field named in the row, and a count of sample data instances in the data set that include a value in the named field;

in response to recognizing a text data type for one of the data fields of a sample data instance, matching the values of the text data field to a human language;

based on the matched human language, tokenizing a value of each text data field to form a corresponding token;

incorporating the corresponding token as a new value for the corresponding text data field in the dataset; and

displaying parameters of the text data field in a corresponding row of the visual summary;

wherein processing the sample data further includesapplying a selected tokenization process to form a set of tokens based on the values in the text data fields;

for a given row in the visual summary corresponding to a text data field in the sample data set, tokenizing all of the respective values of the text data field found in the sample data set to form a set of tokens for the given row;

counting a respective number of occurrences of each one of the tokens; and

storing the counted numbers of occurrences.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and processes are disclosed for advanced text analysis in the field of big data analytics and visualization: Users can now factor text into their predictive models, alongside regression, time/date and categorical information. This is ideal for building models where text content may play a prominent role (e.g., social media or customer service logs). Multiple data types, including text fields, may be combined together in datasets and models, and may be presented in various interactive visualization displays.

Citations

18 Claims

1. A method comprising:
- accessing a set of sample data instances, each instance comprising a corresponding value for at least some of a plurality of data fields and at least one of the data fields characterized as a text data type;
  
  processing the sample data instances so as to form a dataset, the processing including analyzing the sample data instances to recognize a data type for each of the plurality of data fields, the recognition including selecting the data type from a predetermined set of data types that includes at least a numeric data type, a categorical data type, and a text data type;
  
  generating a visual summary of the dataset on a computing device, the visual summary comprising a tabular presentation including a series of rows and columns of information, each row corresponding to one of the data fields of the sample data, and each column displaying a corresponding parameter in each of the rows, wherein the displayed column parameters include a data field name, a type of the data field named in the row, and a count of sample data instances in the data set that include a value in the named field;
  
  in response to recognizing a text data type for one of the data fields of a sample data instance, matching the values of the text data field to a human language;
  
  based on the matched human language, tokenizing a value of each text data field to form a corresponding token;
  
  incorporating the corresponding token as a new value for the corresponding text data field in the dataset; and
  
  displaying parameters of the text data field in a corresponding row of the visual summary;
  
  wherein processing the sample data further includesapplying a selected tokenization process to form a set of tokens based on the values in the text data fields;
  
  for a given row in the visual summary corresponding to a text data field in the sample data set, tokenizing all of the respective values of the text data field found in the sample data set to form a set of tokens for the given row;
  
  counting a respective number of occurrences of each one of the tokens; and
  
  storing the counted numbers of occurrences.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1 wherein:
    - at least one of the columns displays, in each row, an indication of a corresponding number of instances of the sample data that include the corresponding data field;
      
      at least one of the columns displays, in each row, a corresponding number of instances of the sample data that are missing the corresponding data field; and
      
      at least one of the columns displays, in each row, a corresponding number of instances of the sample data that have an error in the corresponding data field.
  - 3. The method of claim 1:
    - wherein at least one of the columns displays, in each row, a corresponding histogram of the values of the corresponding data field in the instances of the sample data.
  - 4. The method of claim 1 and further comprising:
    - generating a tag cloud for a selected text field, and displaying the tag cloud on an electronic display, wherein the tag cloud displays a plurality of the corresponding text field; and
      
      wherein the tokens are displayed in the tag cloud in font sizes that are selected in proportion to the relative frequency of occurrence of each token in the selected text field.
  - 5. The method of claim 4 and further comprising:
    - responsive to receiving an input selection of a word in the tag cloud display, further displaying a number of occurrences of the selected word in a popup overlying or adjacent to the tag cloud display.

6. A method comprising:
- accessing a digital source data file comprising a plurality of records, each record comprising at least one data field;
  
  processing the source data file on a computing device to recognize a data type for each of the data fields;
  
  in response to recognizing a text data type for a particular data field, matching the text of the particular data field to a human language and applying a stemming process corresponding to the matched human language, thereby tokenizing the text to form a corresponding token;
  
  forming a dataset based on the source data file, said forming step including substituting the corresponding token into the dataset in place of each of the tokenized text fields;
  
  displaying an interactive summary of the dataset on a display screen of the computing device;
  
  building a model based at least in part on the dataset;
  
  receiving an indication of a type of visualization to be displayed;
  
  generating a space-filling graphical representation of the model on a computing device, the space-filling graphical representation comprising a plurality of segments arranged to realize the indicated type of visualization; and
  
  displaying the space-filling graphical representation of the model on a display screen of the computing device; and
  
  further displaying a legend adjacent to the space-filling representation of the model.
- View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14)
- - 7. The method of claim 6, further comprising:
    - providing a color scheme based at least in part on the indicated type of visualization; and
      
      displaying the graphical representation of the model on the display screen using the color scheme.
  - 8. The method of claim 6, wherein the interactive summary displaycomprises a tabular presentation, the presentation including a series of rows and columns of information, each row corresponding to a respective one of the data fields of the dataset;
    - and further including displaying a corresponding data type indicator in each row, the data type indicator selected from a set of indicators that includes a first indicator for a text data type and a second indicator, distinct from the first indicator, for a categorical data type.
  - 9. The method of claim 8, wherein the tabular presentation includes:
    - a first column listing a name of the corresponding field in each row;
      
      a second column listing the corresponding data type indicator in each row; and
      
      a third column listing, in each row, a corresponding number of instances of data that have content in the corresponding data field.
  - 10. The method of claim 9, wherein the tabular presentation includes a fourth column in which a respective histogram is displayed in each row that corresponds to a text field, the histogram presenting in graphical form an indication of a relative number of instances of each token of the corresponding field.
  - 11. The method of claim 10, wherein the histogram comprises a series of vertical bars, wherein each bar represents one of the tokens in the corresponding text field, and the relative height of each bar provides a graphic indication of a relative number of instances of the corresponding token.
  - 12. The method of claim 11, wherein the presentation implements an interactive feature that, responsive to user selection of one of the bars of a histogram, automatically displays a pop-up panel that shows the corresponding token represented by the selected bar, and the number of occurrences of the corresponding token.
  - 13. The method of claim 12, wherein the presentation includes, adjacent to a text field histogram, a pop-up user control for scrolling the histogram to present additional bars in the display.
  - 14. The method of claim 9, and further comprising, for a selected row of the presentation that corresponds to a text data type field, generating and displaying a tag cloud representation of the corresponding text data.

15. A visualization method comprising:
- accessing a data model based at least in part on a dataset comprising data items, wherein at least one of the data items includes a text data field;
  
  in response to recognizing a text data type for one of the data fields of a sample data instance, matching the values of the text data field to a human language;
  
  based on the matched human language, tokenizing a value of each text data field to form a corresponding token;
  
  incorporating the corresponding token as a new value for the corresponding text data field in the dataset; and
  
  displaying parameters of the text data field in a corresponding row of a visual summary of the dataset;
  
  generating a decision tree representation of the data model, wherein the decision tree comprises nodes and branches, wherein at least one of the nodes represents a split based on the content of a text field;
  
  displaying at least a selected portion of the decision tree on an electronic display screen;
  
  highlighting a selected prediction path in the displayed portion of the decision tree; and
  
  displaying a legend along with the displayed portion of the decision tree, the legend indicating each split criteria along the selected prediction path;
  
  wherein, for each text field that appears in the legend, a pop-up panel is provisioned to display additional information, responsive to selection of a given text field in the legend, the additional information including a token used to determine the corresponding split in the prediction path.
- View Dependent Claims (16, 17, 18)
- - 16. The method of claim 15 wherein the additional information further includes a number of occurrences or count of the token in the dataset.
  - 17. The method of claim 15 wherein the legend identifies the split criteria for each text field in the selected prediction path by displaying a specific token and whether or not the corresponding text field contains the specific token.
  - 18. The method of claim 15 wherein the legend display includes one display element that compresses multiple selection criteria for a given text field into a single display element, the single display element listing, for the given text field, all of the tokens that the text field contains, and all of the tokens that the text field does not contain.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
BigML, Inc.
Original Assignee
BigML, Inc.
Inventors
Parker, Charles, Ashenfelter, Adam
Primary Examiner(s)
Gofman, Alex
Assistant Examiner(s)
Mian, Umar

Application Number

US14/497,102
Publication Number

US 20150019569A1
Time in Patent Office

789 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 16/156   Query results presentation

G06F 16/26   Visual data mining; Browsin...

G06F 16/322   Trees

G06F 16/3344   using natural language anal...

G06F 16/338   Presentation of query results

Interactive visualization of big data sets and models including textual data

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Interactive visualization of big data sets and models including textual data

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links