Systems and methods for ingesting and parsing datasets generated from disparate data sources
First Claim
1. A computer-implemented method comprising:
- receiving, by a computer, a plurality of text files from a plurality of data sources, each text file associated with a respective contact event via a respective data source, wherein the contact event corresponds to an electronic telecommunication session;
for each text file in the plurality of text files, removing, by the computer, a set of words satisfying a stop word list;
generating, by the computer, a vocabulary file for each text file from the plurality of text files containing a set of words extracted from the plurality of text files, wherein the set of words extracted from the plurality of text files are extracted, by the computer, based on a frequency of occurrence associated with each word satisfying a threshold value;
generating, by the computer, a vector for each text file in the plurality of text files based upon the set of words extracted from each respective text file, wherein a value corresponding to each dimension of the vector is determined by a frequency of occurrence associated with each word in the set of words;
generating, by the computer, a matrix corresponding to the generated vectors;
determining, by the computer, a set of topics for the plurality of text files by decomposing the matrix using a non-negative matrix factorization algorithm;
determining, by the computer, a distance value for each text file in the plurality of text files relative to other text files in the plurality of text files, wherein the distance value between two text files is determined based upon a similarity between two vectors corresponding to the two text files;
generating, by the computer, a graphical user interface displaying a plurality of images representing each respective contact event based upon the distance value determined for each respective text file of each respective contact event;
displaying, by the computer, the graphical user interface on a user device operated by a user; and
in response to receiving from the user device a selection of a subset of the images representing contact events, generating, by the computer, a second graphical user interface containing a plurality of data fields associated with each of text file associated with the contact events of the selection, wherein at least one data field contains one or more extracts of a portion of each text file and the corresponding topic from the set of topics, and wherein the user selects the subset of the images by interacting with the graphical user interface displayed on the user device.
3 Assignments
0 Petitions
Accused Products
Abstract
Disclosed herein are systems and methods capable of performing text exploration on large volume of corpus without prior knowledge in an accurate and efficient manner and may also provide any number of additional or alternative benefits and advantages. In particular, embodiments described herein provide a text exploration executable environment that uses unsupervised machine-learning to assist a human analyst with distilling key emerging themes from a corpus of hundreds or thousands of text files presented in a time series graphical user interface (GUI). A document may be a unit of text under analysis received from a particular data source, such as word-processing documents, paragraphs, sentences, chat sessions, speech-to-text call segments, online texts, social media postings (e.g., Tweets®), and other machine-readable text. In operation, a human analyst may use a text exploration software tool to identify the themes and stories within the corpus, by using integrated, synchronized GUIs that are dynamically generated by the software exploration tool.
-
Citations
18 Claims
-
1. A computer-implemented method comprising:
-
receiving, by a computer, a plurality of text files from a plurality of data sources, each text file associated with a respective contact event via a respective data source, wherein the contact event corresponds to an electronic telecommunication session; for each text file in the plurality of text files, removing, by the computer, a set of words satisfying a stop word list; generating, by the computer, a vocabulary file for each text file from the plurality of text files containing a set of words extracted from the plurality of text files, wherein the set of words extracted from the plurality of text files are extracted, by the computer, based on a frequency of occurrence associated with each word satisfying a threshold value; generating, by the computer, a vector for each text file in the plurality of text files based upon the set of words extracted from each respective text file, wherein a value corresponding to each dimension of the vector is determined by a frequency of occurrence associated with each word in the set of words; generating, by the computer, a matrix corresponding to the generated vectors; determining, by the computer, a set of topics for the plurality of text files by decomposing the matrix using a non-negative matrix factorization algorithm; determining, by the computer, a distance value for each text file in the plurality of text files relative to other text files in the plurality of text files, wherein the distance value between two text files is determined based upon a similarity between two vectors corresponding to the two text files; generating, by the computer, a graphical user interface displaying a plurality of images representing each respective contact event based upon the distance value determined for each respective text file of each respective contact event; displaying, by the computer, the graphical user interface on a user device operated by a user; and in response to receiving from the user device a selection of a subset of the images representing contact events, generating, by the computer, a second graphical user interface containing a plurality of data fields associated with each of text file associated with the contact events of the selection, wherein at least one data field contains one or more extracts of a portion of each text file and the corresponding topic from the set of topics, and wherein the user selects the subset of the images by interacting with the graphical user interface displayed on the user device. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A computer system comprising:
-
a user device; and a computer in communication with the user device, wherein the computer is configured to; receive a plurality of text files from a plurality of data sources, each text file associated with a respective contact event via a respective data source, wherein the contact event corresponds to an electronic telecommunication session; for each text file in the plurality of text files, remove a set of words satisfying a stop word list; generate a vocabulary file for each text file from the plurality of text files containing a set of words extracted from the plurality of text files, wherein the set of words extracted from the plurality of text files are extracted, by the computer, based on a frequency of occurrence associated with each word satisfying a threshold value; generate a vector for each text file in the plurality of text files based upon the set of words extracted from each respective text file, wherein a value corresponding to each dimension of the vector is determined by a frequency of occurrence associated with each word in the set of words; generate a matrix corresponding to the generated vectors; determine a set of topics for the plurality of text files by decomposing the matrix using a non-negative matrix factorization algorithm; determine a distance value for each text file in the plurality of text files relative to other text files in the plurality of text files, wherein the distance value between two text files is determined based upon a similarity between two vectors corresponding to the two text files; generate a graphical user interface displaying a plurality of images representing each respective contact event based upon the distance value determined for each respective text file of each respective contact event; display the graphical user interface on the user device operated by a user; and in response to receiving from the user device a selection of a subset of the images representing contact events, generate a second graphical user interface containing a plurality of data fields associated with each of text file associated with the contact events of the selection, wherein at least one data field contains one or more extracts of a portion of each text file and the corresponding topic from the set of topics, and wherein the user selects the subset of the images by interacting with the graphical user interface displayed on the user device. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
Specification