Methods and systems for the analysis of large text corpora

US 9,135,242 B1
Filed: 03/15/2013
Issued: 09/15/2015
Est. Priority Date: 10/10/2011
Status: Active Grant

First Claim

Patent Images

1. A computerized method for the analysis of textual data, comprising:

receiving, from one or more memories at one or more processors, textual data to be analyzed;

using the one or more processors, formatting the textual data for subsequent analysis;

using the one or more processors, applying a probabilistic topic model to the textual data to extract a set of semantically meaningful topics that collectively describe all or a portion of the textual data;

using a keyword weighting module executed on the one or more processors, generating a topic cloud view representing the topics as a tagcloud with each being associated with a plurality of keywords;

using a topic ordering module executed on the one or more processors, generating a document distribution view representing a distribution of all or a portion of the textual data across multiple topics;

using a document entropy calculation module executed on the one or more processors, generating a document scatterplot view representing how many topics are attributable to all or a portion of the textual data;

using a temporal topic trend calculation module executed on the one or more processors, generating a temporal view representing changes in the occurrence of topics over time in relation to all or a portion of the textual data; and

displaying one or more of the topic cloud view, the document distribution view, the document scatterplot view, and the temporal view to a user in the analysis of all or a portion of the textual data.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Computerized methods and systems for the analysis of textual data, including: receiving, from one or more memories at one or more processors, textual data; using the processors, formatting the textual data for analysis and applying a probabilistic topic model to the textual data to extract semantically meaningful topics that collectively describe it; using a keyword weighting module, generating a topic cloud view representing the topics as a tagcloud with each being associated with a plurality of keywords; using a topic ordering module, generating a document distribution view representing a distribution of the textual data across multiple topics; using a document entropy calculation module, generating a document scatterplot view representing how many topics are attributable to the textual data; using a temporal topic trend calculation module, generating a temporal view representing changes in the occurrence of topics over time; and displaying one or more of the views to a user.

Citations

20 Claims

1. A computerized method for the analysis of textual data, comprising:
- receiving, from one or more memories at one or more processors, textual data to be analyzed;
  
  using the one or more processors, formatting the textual data for subsequent analysis;
  
  using the one or more processors, applying a probabilistic topic model to the textual data to extract a set of semantically meaningful topics that collectively describe all or a portion of the textual data;
  
  using a keyword weighting module executed on the one or more processors, generating a topic cloud view representing the topics as a tagcloud with each being associated with a plurality of keywords;
  
  using a topic ordering module executed on the one or more processors, generating a document distribution view representing a distribution of all or a portion of the textual data across multiple topics;
  
  using a document entropy calculation module executed on the one or more processors, generating a document scatterplot view representing how many topics are attributable to all or a portion of the textual data;
  
  using a temporal topic trend calculation module executed on the one or more processors, generating a temporal view representing changes in the occurrence of topics over time in relation to all or a portion of the textual data; and
  
  displaying one or more of the topic cloud view, the document distribution view, the document scatterplot view, and the temporal view to a user in the analysis of all or a portion of the textual data.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The computerized method of claim 1, wherein the textual data comprises one or more of textual data derived from a plurality of documents, textual data derived from a plurality of files, textual data derived from one or more data storage repositories, and textual data derived from the Internet.
  - 3. The computerized method of claim 1, wherein formatting the textual data for subsequent analysis comprises one or more of stopword removal, duplicated-content removal, part-of-speech analysis, n-gram analysis of sentences to extract segments, entity extraction analysis to extract named entities, sentiment analysis of the basic sentiment of documents or paragraphs, and temporal and spatial indicator extraction.
  - 4. The computerized method of claim 1, wherein the probabilistic topic model generates a set of latent topics and represents each topic as a multinomial distribution over a plurality of keywords.
  - 5. The computerized method of claim 4, wherein the textual data is described as a probabilistic mixture of topics.
  - 6. The computerized method of claim 1, wherein the probabilistic topic model comprises Latent Dirichet Allocation (LDA).
  - 7. The computerized method of claim 1, wherein the keywords are ordered to indicate their importance to a given topic and relationship to one another.
  - 8. The computerized method of claim 1, wherein the keywords are highlighted to indicate their importance to multiple topics.
  - 9. The computerized method of claim 1, wherein topics are ordered to represent their relationships.
  - 10. The computerized method of claim 1, wherein the document entropy calculation module utilizes a Shannon entropy calculation.

11. A computerized system for the analysis of textual data, comprising:
- one or more memories operable for storing and one or more processors operable for receiving textual data to be analyzed;
  
  an algorithm executed on the one or more processors operable for formatting the textual data for subsequent analysis;
  
  an algorithm executed on the one or more processors operable for applying a probabilistic topic model to the textual data to extract a set of semantically meaningful topics that collectively describe all or a portion of the textual data;
  
  a keyword weighting module executed on the one or more processors operable for generating a topic cloud view representing the topics as a tagcloud with each being associated with a plurality of keywords;
  
  a topic ordering module executed on the one or more processors operable for generating a document distribution view representing a distribution of all or a portion of the textual data across multiple topics;
  
  a document entropy calculation module executed on the one or more processors operable for generating a document scatterplot view representing how many topics are attributable to all or a portion of the textual data;
  
  a temporal topic trend calculation module executed on the one or more processors operable for generating a temporal view representing changes in the occurrence of topics over time in relation to all or a portion of the textual data; and
  
  a display operable for displaying one or more of the topic cloud view, the document distribution view, the document scatterplot view, and the temporal view to a user in the analysis of all or a portion of the textual data.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 12. The computerized system of claim 11, wherein the textual data comprises one or more of textual data derived from a plurality of documents, textual data derived from a plurality of files, textual data derived from one or more data storage repositories, and textual data derived from the Internet.
  - 13. The computerized system of claim 11, wherein formatting the textual data for subsequent analysis comprises one or more of word binning, geo-spatial binning, temporal information binning, entity-level content binning, document similarity comparison, document probability distribution, entropy analysis, document segmentation, word frequency detection, data coordination, GUI design, direct visual manipulation, and data-visual-element transformation and correlation.
  - 14. The computerized system of claim 11, wherein the probabilistic topic model generates a set of latent topics and represents each topic as a multinomial distribution over a plurality of keywords.
  - 15. The computerized system of claim 14, wherein the textual data is described as a probabilistic mixture of topics.
  - 16. The computerized system of claim 11, wherein the probabilistic topic model comprises Latent Dirichet Allocation (LDA).
  - 17. The computerized system of claim 11, wherein the keywords are ordered to indicate their importance to a given topic and relationship to one another.
  - 18. The computerized system of claim 11, wherein the keywords are highlighted to indicate their importance to multiple topics.
  - 19. The computerized system of claim 11, wherein topics are ordered to represent their relationships.
  - 20. The computerized system of claim 11, wherein the document entropy calculation module utilizes a Shannon entropy calculation.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Stratifyd Software LLC
Original Assignee
University of North Carolina At Charlotte (University of North Carolina System)
Inventors
Wang, Xiaoyu, Dou, Wenwen, Ribarsky, William
Primary Examiner(s)
VO, HUYEN X

Application Number

US13/832,339
Time in Patent Office

914 Days
Field of Search

704 1- 10, 704/251, 704/255, 704/257, 704/270, 704/270.1, 707/708, 707/737, 707/738
US Class Current

1/1
CPC Class Codes

G06F 40/30 Semantic analysis

Methods and systems for the analysis of large text corpora

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and systems for the analysis of large text corpora

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links