Systems and methods for ingesting and parsing datasets generated from disparate data sources

US 10,444,945 B1
Filed: 10/04/2017
Issued: 10/15/2019
Est. Priority Date: 10/10/2016
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

receiving, by a computer, a plurality of text files from a plurality of data sources, each text file associated with a respective contact event via a respective data source, wherein the contact event corresponds to an electronic telecommunication session;

for each text file in the plurality of text files, removing, by the computer, a set of words satisfying a stop word list;

generating, by the computer, a vocabulary file for each text file from the plurality of text files containing a set of words extracted from the plurality of text files, wherein the set of words extracted from the plurality of text files are extracted, by the computer, based on a frequency of occurrence associated with each word satisfying a threshold value;

generating, by the computer, a vector for each text file in the plurality of text files based upon the set of words extracted from each respective text file, wherein a value corresponding to each dimension of the vector is determined by a frequency of occurrence associated with each word in the set of words;

generating, by the computer, a matrix corresponding to the generated vectors;

determining, by the computer, a set of topics for the plurality of text files by decomposing the matrix using a non-negative matrix factorization algorithm;

determining, by the computer, a distance value for each text file in the plurality of text files relative to other text files in the plurality of text files, wherein the distance value between two text files is determined based upon a similarity between two vectors corresponding to the two text files;

generating, by the computer, a graphical user interface displaying a plurality of images representing each respective contact event based upon the distance value determined for each respective text file of each respective contact event;

displaying, by the computer, the graphical user interface on a user device operated by a user; and

in response to receiving from the user device a selection of a subset of the images representing contact events, generating, by the computer, a second graphical user interface containing a plurality of data fields associated with each of text file associated with the contact events of the selection, wherein at least one data field contains one or more extracts of a portion of each text file and the corresponding topic from the set of topics, and wherein the user selects the subset of the images by interacting with the graphical user interface displayed on the user device.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed herein are systems and methods capable of performing text exploration on large volume of corpus without prior knowledge in an accurate and efficient manner and may also provide any number of additional or alternative benefits and advantages. In particular, embodiments described herein provide a text exploration executable environment that uses unsupervised machine-learning to assist a human analyst with distilling key emerging themes from a corpus of hundreds or thousands of text files presented in a time series graphical user interface (GUI). A document may be a unit of text under analysis received from a particular data source, such as word-processing documents, paragraphs, sentences, chat sessions, speech-to-text call segments, online texts, social media postings (e.g., Tweets®), and other machine-readable text. In operation, a human analyst may use a text exploration software tool to identify the themes and stories within the corpus, by using integrated, synchronized GUIs that are dynamically generated by the software exploration tool.

Citations

18 Claims

1. A computer-implemented method comprising:
- receiving, by a computer, a plurality of text files from a plurality of data sources, each text file associated with a respective contact event via a respective data source, wherein the contact event corresponds to an electronic telecommunication session;
  
  for each text file in the plurality of text files, removing, by the computer, a set of words satisfying a stop word list;
  
  generating, by the computer, a vocabulary file for each text file from the plurality of text files containing a set of words extracted from the plurality of text files, wherein the set of words extracted from the plurality of text files are extracted, by the computer, based on a frequency of occurrence associated with each word satisfying a threshold value;
  
  generating, by the computer, a vector for each text file in the plurality of text files based upon the set of words extracted from each respective text file, wherein a value corresponding to each dimension of the vector is determined by a frequency of occurrence associated with each word in the set of words;
  
  generating, by the computer, a matrix corresponding to the generated vectors;
  
  determining, by the computer, a set of topics for the plurality of text files by decomposing the matrix using a non-negative matrix factorization algorithm;
  
  determining, by the computer, a distance value for each text file in the plurality of text files relative to other text files in the plurality of text files, wherein the distance value between two text files is determined based upon a similarity between two vectors corresponding to the two text files;
  
  generating, by the computer, a graphical user interface displaying a plurality of images representing each respective contact event based upon the distance value determined for each respective text file of each respective contact event;
  
  displaying, by the computer, the graphical user interface on a user device operated by a user; and
  
  in response to receiving from the user device a selection of a subset of the images representing contact events, generating, by the computer, a second graphical user interface containing a plurality of data fields associated with each of text file associated with the contact events of the selection, wherein at least one data field contains one or more extracts of a portion of each text file and the corresponding topic from the set of topics, and wherein the user selects the subset of the images by interacting with the graphical user interface displayed on the user device.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, further comprising:
    - determining, by the computer, a term-topic matrix comprising weight values of strength of association between each word in the vocabulary file and each topic in the set of topics.
  - 3. The method of claim 1, further comprising:
    - determining, by the computer, a topic-document matrix comprising a weight value of strength of association between each topic in the set of topics and each text file in the plurality of text files.
  - 4. The method of claim 3, further comprising:
    - determining, by the computer, a primary topic for each text file based on the weight values in the topic-document matrix, wherein the primary topic is corresponding to the topic with a highest weight value for the text file.
  - 5. The method of claim 4, further comprising:
    - determining, by the computer, trend of the primary topic over a period of time.
  - 6. The method of claim 1, further comprising;
    - in response to receiving a selection of a subset of the images representing contact events, generating, by the computer, a new graphical user interface displaying topics and the words in the vocabulary file that are most relevant to the topics associated with each of the text files associated with the contact events of the selection.
  - 7. The method of claim 1, further comprising:
    - in response to receiving a selection of a subset of the images representing contact events, generating, by the computer, a new graphical user interface displaying primary topics and trend of the primary topics associated with each of the text files associated with the contact events of the selection.
  - 8. The method of claim 1, wherein the plurality of text files comprise transcriptions of telephone calls, online chat sessions, emails, and texts from surveys and social media networks.
  - 9. The method of claim 1, wherein the extracts of a portion of the text file is filtered by topic number, topic term, document author, and date.

10. A computer system comprising:
- a user device; and
  
  a computer in communication with the user device, wherein the computer is configured to;
  
  receive a plurality of text files from a plurality of data sources, each text file associated with a respective contact event via a respective data source, wherein the contact event corresponds to an electronic telecommunication session;
  
  for each text file in the plurality of text files, remove a set of words satisfying a stop word list;
  
  generate a vocabulary file for each text file from the plurality of text files containing a set of words extracted from the plurality of text files, wherein the set of words extracted from the plurality of text files are extracted, by the computer, based on a frequency of occurrence associated with each word satisfying a threshold value;
  
  generate a vector for each text file in the plurality of text files based upon the set of words extracted from each respective text file, wherein a value corresponding to each dimension of the vector is determined by a frequency of occurrence associated with each word in the set of words;
  
  generate a matrix corresponding to the generated vectors;
  
  determine a set of topics for the plurality of text files by decomposing the matrix using a non-negative matrix factorization algorithm;
  
  determine a distance value for each text file in the plurality of text files relative to other text files in the plurality of text files, wherein the distance value between two text files is determined based upon a similarity between two vectors corresponding to the two text files;
  
  generate a graphical user interface displaying a plurality of images representing each respective contact event based upon the distance value determined for each respective text file of each respective contact event;
  
  display the graphical user interface on the user device operated by a user; and
  
  in response to receiving from the user device a selection of a subset of the images representing contact events, generate a second graphical user interface containing a plurality of data fields associated with each of text file associated with the contact events of the selection, wherein at least one data field contains one or more extracts of a portion of each text file and the corresponding topic from the set of topics, and wherein the user selects the subset of the images by interacting with the graphical user interface displayed on the user device.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- - 11. The computer system of claim 10, wherein the computer is further configured to:
    - determine a term-topic matrix comprising weight values of strength of association between each word in the vocabulary file and each topic in the set of topics.
  - 12. The computer system of claim 10, wherein the computer is further configured to:
    - determine a topic-document matrix comprising a weight value of strength of association between each topic in the set of topics and each text file in the plurality of text files.
  - 13. The computer system of claim 12, wherein the computer is further configured to:
    - determine a primary topic for each text file based on the weight value in the topic-document matrix, wherein the primary topic is corresponding to the topic with a highest weight value for the text file.
  - 14. The computer system of claim 13, wherein the computer is further configured to:
    - determine trend of the primary topic over a period of time.
  - 15. The computer system of claim 10, wherein the computer is further configured to:
    - in response to receiving a selection of a subset of the images representing contact events, generate a new graphical user interface displaying topics and the words in the vocabulary file that are most relevant to the topics associated with each of the text files associated with the contact events of the selection.
  - 16. The computer system of claim 10, wherein the computer is further configured to:
    - in response to receiving a selection of a subset of the images representing contact events, generate a new graphical user interface displaying primary topics and trend of the primary topics associated with each of the text files associated with the contact events of the selection.
  - 17. The computer system of claim 10, wherein the plurality of text files comprise transcriptions of telephone calls, online chat sessions, emails, and texts from surveys and social media networks.
  - 18. The computer system of claim 10, wherein the extracts of a portion of the text file is filtered by topic number, topic term, document author, and date.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
United Services Automobile Association
Original Assignee
United Services Automobile Association
Inventors
Fehlman, II, William Leland
Primary Examiner(s)
Xiao, Di

Application Number

US15/725,094
Time in Patent Office

741 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/34   Browsing; Visualisation the...

G06F 16/353   into predefined classes

G06F 17/16   Matrix or vector computatio...

G06F 3/0482   Interaction with lists of s...

G06F 40/123   Storage facilities

G06F 40/216   using statistical methods

G06F 40/279   Recognition of textual enti...

G06F 40/284   Lexical analysis, e.g. toke...

G06F 40/30   Semantic analysis

G06N 20/00   Machine learning

Systems and methods for ingesting and parsing datasets generated from disparate data sources

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for ingesting and parsing datasets generated from disparate data sources

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links