Data analytics system and methods for text data

US 10,275,444 B2
Filed: 07/15/2016
Issued: 04/30/2019
Est. Priority Date: 07/15/2016
Status: Active Grant

First Claim

Patent Images

1. A device, comprising:

a processing system including a processor; and

a memory that stores executable instructions that, when executed by the processing system, facilitate performance of operations, comprising;

performing a statistical natural language processing analysis on a plurality of text documents to determine a plurality of topics, wherein prior to performing the statistical natural language processing analysis, a training is performed on sample documents to determine parameters for the statistical natural language processing analysis;

creating a proper subset of topics from the plurality of topics, based on user input;

mapping a topic in the proper subset of topics to each document in the plurality of text documents, thereby creating a plurality of topic-document pairs;

for each topic-document pair of the plurality of topic-document pairs, identifying a bias from text in a corresponding document of the topic-document pair;

creating clusters of topics from the proper subset of topics, wherein each cluster of topics is determined from the bias of each topic-document pair and a frequency of occurrence of each topic in the document identified by the topic-document pair, and wherein the clusters of topics have an image configuration based on the bias and the frequency of occurrence that distinguishes one cluster from another; and

generating presentable content depicting each cluster of the clusters of topics according to a corresponding image configuration, wherein the image configuration specifies that an area for each cluster of topics is subdivided into separate sub-areas for each topic, wherein the sub-area for each topic represents a frequency of occurrence of that topic.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Aspects of the subject disclosure may include, for example, a computer that performs a statistical natural language processing analysis on a plurality of text documents to determine a plurality of topics, creates a proper subset of topics from the plurality of topics, based on user input, maps one or more topics in the proper subset of topics to each document in the plurality of text documents, thereby creating a plurality of topic-document pairs, identifies n-dimensions of bias for each topic-document pair from the text, creates clusters of topics from the proper subset of topics, and generates presentable content depicting each cluster of the clusters of topics according to a corresponding image configuration. The topics and n-dimensions of bias data can be further analyzed with co-collected structured data for statistical relationships. The topics and n-dimensions of bias data can be used for a publisher-subscriber network that uses content-driven routing when delivering raw data and summarized data via the network. Other embodiments are disclosed.

Citations

20 Claims

1. A device, comprising:
- a processing system including a processor; and
  
  a memory that stores executable instructions that, when executed by the processing system, facilitate performance of operations, comprising;
  
  performing a statistical natural language processing analysis on a plurality of text documents to determine a plurality of topics, wherein prior to performing the statistical natural language processing analysis, a training is performed on sample documents to determine parameters for the statistical natural language processing analysis;
  
  creating a proper subset of topics from the plurality of topics, based on user input;
  
  mapping a topic in the proper subset of topics to each document in the plurality of text documents, thereby creating a plurality of topic-document pairs;
  
  for each topic-document pair of the plurality of topic-document pairs, identifying a bias from text in a corresponding document of the topic-document pair;
  
  creating clusters of topics from the proper subset of topics, wherein each cluster of topics is determined from the bias of each topic-document pair and a frequency of occurrence of each topic in the document identified by the topic-document pair, and wherein the clusters of topics have an image configuration based on the bias and the frequency of occurrence that distinguishes one cluster from another; and
  
  generating presentable content depicting each cluster of the clusters of topics according to a corresponding image configuration, wherein the image configuration specifies that an area for each cluster of topics is subdivided into separate sub-areas for each topic, wherein the sub-area for each topic represents a frequency of occurrence of that topic.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The device of claim 1, wherein the image configuration comprises size, shape, color coding, or any combination thereof.
  - 3. The device of claim 2, wherein the presentable content includes a Pareto analysis of bias associated with each topic in each cluster of the clusters of topics.
  - 4. The device of claim 3, wherein the image configuration comprises an area for each cluster in the clusters of topics.
  - 5. The device of claim 4, wherein a size of the area for each cluster in the clusters of topics represents the frequency of occurrence of each topic in the clusters of topics.
  - 6. The device of claim 5, wherein the area for each cluster in the clusters of topics is subdivided into separate areas for each topic in a cluster in the clusters of topics, wherein a separate area for a topic represents the frequency of occurrence of the topic in the cluster.
  - 7. The device of claim 6, wherein the separate area for the topic further comprises a color that represents the bias of the topic.
  - 8. The device of claim 7, wherein identifying the bias of each topic-document pair comprises a latent semantic analysis.
  - 9. The device of claim 8, wherein identifying the bias comprises one of positive bias, neutral, or negative bias.
  - 10. The device of claim 9, wherein the statistical natural language processing analysis comprises a latent Dirichlet allocation.
  - 11. The device of claim 10, wherein the processing system includes a plurality of processors operating in a distributed processing environment.

12. A non-transitory computer-readable storage medium, comprising executable instructions that, when executed by a processing system including a processor, facilitate performance of operations, comprising:
- performing training on a plurality of sample documents to determine parameters for further analysis in order to control a number of topics determined by the further analysis, the further analysis comprising;
  
  determining a plurality of topics from a plurality of text documents;
  
  mapping one or more topics in the plurality of topics to each document in the plurality of text documents;
  
  reducing the plurality of topics into a proper subset of topics based on a frequency of occurrence of each topic in the plurality of text documents;
  
  identifying n-dimensions of bias for each topic in the proper subset of topics, the n-dimensions of bias identified from text in a corresponding document mapped to the topic;
  
  creating clusters of topics from the proper subset of topics, wherein each cluster of topics in the clusters of topics is determined from a latent semantic analysis comprising singular value decomposition into orthogonal dimensions, wherein each cluster of topics has an image configuration based on the n-dimensions of bias and the frequency of occurrence for topics in the clusters of topics that distinguishes one cluster from another; and
  
  generating presentable content illustrating each cluster of the clusters of topics according to a corresponding image configuration.
- View Dependent Claims (13, 14, 15, 16, 17)
- - 13. The non-transitory computer-readable storage medium of claim 12, wherein the image configuration comprises a geometric area for a topic in a cluster of topics.
  - 14. The non-transitory computer-readable storage medium of claim 13, wherein a size of the geometric area for the topic in the cluster of topics represents a summary statistic in the plurality of text documents comprising the topic.
  - 15. The non-transitory computer-readable storage medium of claim 14, wherein a color of the geometric area for the topic in the cluster of topics represents the n-dimensions of bias of the topic from the plurality of text documents associated with the topic.
  - 16. The non-transitory computer-readable storage medium of claim 12, wherein the processing system includes a plurality of processors operating in a distributed processing environment.
  - 17. The non-transitory computer-readable storage medium of claim 12, wherein identifying the n-dimensions of bias comprises one of positive bias, neutral, or negative bias.

18. A method, comprising:
- performing, by a system comprising a processor, a latent Dirichlet allocation of a plurality of text documents to determine a plurality of topics, wherein the plurality of topics are determined according to parameters, wherein the parameters are determined by training on a sample of documents in order to control a number of the plurality of topics that are determined;
  
  creating, by the system, a proper subset of topics from the plurality of topics, based on user input;
  
  mapping, by the system, one or more topics in the proper subset of topics to each document in the plurality of text documents, thereby creating a plurality of topic-document pairs;
  
  performing, by the system, a latent semantic analysis of text in the document associated with each topic-document pair to determine n-dimensions of bias for each topic-document pair;
  
  creating, by the system, clusters of topics from the proper subset of topics, wherein each cluster of topics is determined from a value for each bias dimension of each topic-document pair and a frequency of occurrence of each topic in the plurality of text documents; and
  
  generating, by the system, presentable content that illustrates each cluster of the clusters of topics according to a corresponding image configuration, wherein the image configuration is based on all or a subset of the bias dimensions and the frequency of occurrence of topics in a cluster that distinguishes the cluster from other clusters.
- View Dependent Claims (19, 20)
- - 19. The method of claim 18, wherein the latent semantic analysis comprises singular value decomposition into orthogonal dimensions.
  - 20. The method of claim 18, wherein the user input comprises merging two topics.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
AT&T Intellectual Property I LP (AT&T, Inc.)
Original Assignee
AT&T Intellectual Property I LP (AT&T, Inc.)
Inventors
Bogdan, Pamela, Gressel, Gary, Reser, Gary, Rubarkh, Alex, Shirley, Kenneth
Primary Examiner(s)
Guerra-Erazo, Edgar X

Application Number

US15/211,837
Publication Number

US 20180018316A1
Time in Patent Office

1,019 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/285   Clustering or classification

G06F 16/355   Class or cluster creation o...

G06F 16/358   Browsing; Visualisation the...

G06F 40/216   using statistical methods

G06F 40/30   Semantic analysis

Data analytics system and methods for text data

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Data analytics system and methods for text data

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links