Interactive visualization of big data sets and models including textual data
First Claim
Patent Images
1. A method comprising:
- accessing a set of sample data instances, each instance comprising a corresponding value for at least some of a plurality of data fields and at least one of the data fields characterized as a text data type;
processing the sample data instances so as to form a dataset, the processing including analyzing the sample data instances to recognize a data type for each of the plurality of data fields, the recognition including selecting the data type from a predetermined set of data types that includes at least a numeric data type, a categorical data type, and a text data type;
generating a visual summary of the dataset on a computing device, the visual summary comprising a tabular presentation including a series of rows and columns of information, each row corresponding to one of the data fields of the sample data, and each column displaying a corresponding parameter in each of the rows, wherein the displayed column parameters include a data field name, a type of the data field named in the row, and a count of sample data instances in the data set that include a value in the named field;
in response to recognizing a text data type for one of the data fields of a sample data instance, matching the values of the text data field to a human language;
based on the matched human language, tokenizing a value of each text data field to form a corresponding token;
incorporating the corresponding token as a new value for the corresponding text data field in the dataset; and
displaying parameters of the text data field in a corresponding row of the visual summary;
wherein processing the sample data further includesapplying a selected tokenization process to form a set of tokens based on the values in the text data fields;
for a given row in the visual summary corresponding to a text data field in the sample data set, tokenizing all of the respective values of the text data field found in the sample data set to form a set of tokens for the given row;
counting a respective number of occurrences of each one of the tokens; and
storing the counted numbers of occurrences.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems and processes are disclosed for advanced text analysis in the field of big data analytics and visualization: Users can now factor text into their predictive models, alongside regression, time/date and categorical information. This is ideal for building models where text content may play a prominent role (e.g., social media or customer service logs). Multiple data types, including text fields, may be combined together in datasets and models, and may be presented in various interactive visualization displays.
-
Citations
18 Claims
-
1. A method comprising:
-
accessing a set of sample data instances, each instance comprising a corresponding value for at least some of a plurality of data fields and at least one of the data fields characterized as a text data type; processing the sample data instances so as to form a dataset, the processing including analyzing the sample data instances to recognize a data type for each of the plurality of data fields, the recognition including selecting the data type from a predetermined set of data types that includes at least a numeric data type, a categorical data type, and a text data type; generating a visual summary of the dataset on a computing device, the visual summary comprising a tabular presentation including a series of rows and columns of information, each row corresponding to one of the data fields of the sample data, and each column displaying a corresponding parameter in each of the rows, wherein the displayed column parameters include a data field name, a type of the data field named in the row, and a count of sample data instances in the data set that include a value in the named field; in response to recognizing a text data type for one of the data fields of a sample data instance, matching the values of the text data field to a human language; based on the matched human language, tokenizing a value of each text data field to form a corresponding token; incorporating the corresponding token as a new value for the corresponding text data field in the dataset; and displaying parameters of the text data field in a corresponding row of the visual summary; wherein processing the sample data further includes applying a selected tokenization process to form a set of tokens based on the values in the text data fields; for a given row in the visual summary corresponding to a text data field in the sample data set, tokenizing all of the respective values of the text data field found in the sample data set to form a set of tokens for the given row; counting a respective number of occurrences of each one of the tokens; and storing the counted numbers of occurrences. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method comprising:
-
accessing a digital source data file comprising a plurality of records, each record comprising at least one data field; processing the source data file on a computing device to recognize a data type for each of the data fields; in response to recognizing a text data type for a particular data field, matching the text of the particular data field to a human language and applying a stemming process corresponding to the matched human language, thereby tokenizing the text to form a corresponding token; forming a dataset based on the source data file, said forming step including substituting the corresponding token into the dataset in place of each of the tokenized text fields; displaying an interactive summary of the dataset on a display screen of the computing device; building a model based at least in part on the dataset; receiving an indication of a type of visualization to be displayed; generating a space-filling graphical representation of the model on a computing device, the space-filling graphical representation comprising a plurality of segments arranged to realize the indicated type of visualization; and displaying the space-filling graphical representation of the model on a display screen of the computing device; and
further displaying a legend adjacent to the space-filling representation of the model. - View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A visualization method comprising:
-
accessing a data model based at least in part on a dataset comprising data items, wherein at least one of the data items includes a text data field; in response to recognizing a text data type for one of the data fields of a sample data instance, matching the values of the text data field to a human language; based on the matched human language, tokenizing a value of each text data field to form a corresponding token; incorporating the corresponding token as a new value for the corresponding text data field in the dataset; and displaying parameters of the text data field in a corresponding row of a visual summary of the dataset; generating a decision tree representation of the data model, wherein the decision tree comprises nodes and branches, wherein at least one of the nodes represents a split based on the content of a text field; displaying at least a selected portion of the decision tree on an electronic display screen; highlighting a selected prediction path in the displayed portion of the decision tree; and displaying a legend along with the displayed portion of the decision tree, the legend indicating each split criteria along the selected prediction path; wherein, for each text field that appears in the legend, a pop-up panel is provisioned to display additional information, responsive to selection of a given text field in the legend, the additional information including a token used to determine the corresponding split in the prediction path. - View Dependent Claims (16, 17, 18)
-
Specification