Apparatus for automatic theme detection from unstructured data
First Claim
Patent Images
1. A system comprising:
- a repository of unstructured documents stored in a computing system;
a natural language processor configured to perform language processing; and
a non-transitory computer-readable storage medium comprising instructions that, when executed, enable a computing system to detect themes within the unstructured documents by;
removing noise words from the unstructured documents, to yield clean documents;
initiate a sentiment computation component configured to determine sentiment of each word in the clean documents by;
assigning to each word in the clean documents at least one of a positive sentiment, a negative sentiment, and neutral sentiment, to yield assigned sentiments;
determining a sentiment probability of a section of the unstructured data based on the assigned sentiments of words in the section, to yield assigned sectional sentiment; and
determining an overall sentiment probability distribution for the unstructured documents based on the assigned sectional sentiment of multiple sections of the unstructured documents;
initiate a theme detection component configured to;
discover themes based on topics with neutral sentiment when the topics are located in a section of the unstructured documents with a sentiment probability that is greater than an overall sentiment probability distribution;
assign labels to each discovered theme;
identify patterns that describe each theme;
identify instances of the themes within individual documents of the unstructured documents based on a presence of the patterns in the individual documents; and
organize the themes in a hierarchy using the instances of the themes; and
initiate a user interface configured to;
allow an operator to initiate theme detection by the theme detection component; and
allow an operator to view and interact with results of the theme detection, wherein the results comprise at least one of the assigned labels, the patterns, and the hierarchy.
7 Assignments
0 Petitions
Accused Products
Abstract
This apparatus provides a system and method of determining significant repeating themes in a collection of documents. The apparatus operates unsupervised and leverages a natural language processing mechanism supported with lexicon, synonym and taxonomy dictionaries to determine themes and establish their relevance using a two-level hierarchical structure. The apparatus also assigns meaningful names to identified themes and determines a set of rules that describe the theme such that it can be applied to categorize other documents outside of the collection as well.
204 Citations
41 Claims
-
1. A system comprising:
-
a repository of unstructured documents stored in a computing system; a natural language processor configured to perform language processing; and a non-transitory computer-readable storage medium comprising instructions that, when executed, enable a computing system to detect themes within the unstructured documents by; removing noise words from the unstructured documents, to yield clean documents; initiate a sentiment computation component configured to determine sentiment of each word in the clean documents by; assigning to each word in the clean documents at least one of a positive sentiment, a negative sentiment, and neutral sentiment, to yield assigned sentiments; determining a sentiment probability of a section of the unstructured data based on the assigned sentiments of words in the section, to yield assigned sectional sentiment; and determining an overall sentiment probability distribution for the unstructured documents based on the assigned sectional sentiment of multiple sections of the unstructured documents; initiate a theme detection component configured to; discover themes based on topics with neutral sentiment when the topics are located in a section of the unstructured documents with a sentiment probability that is greater than an overall sentiment probability distribution; assign labels to each discovered theme; identify patterns that describe each theme; identify instances of the themes within individual documents of the unstructured documents based on a presence of the patterns in the individual documents; and organize the themes in a hierarchy using the instances of the themes; and initiate a user interface configured to; allow an operator to initiate theme detection by the theme detection component; and allow an operator to view and interact with results of the theme detection, wherein the results comprise at least one of the assigned labels, the patterns, and the hierarchy. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method of determining themes from a collection of unstructured text documents, the method comprising:
-
receiving a set of unstructured text documents to process; removing noise words from the unstructured text documents, to yield clean documents; initiate a language processing component configured to process unstructured data collected from the unstructured text documents; determining, by a computer system configured to perform natural language processing, sentiment associated with each word in the clean documents, by; assigning to each word in the clean documents at least one of a positive sentiment, a negative sentiment, and a neutral sentiment, to yield assigned sentiments; determining a sentiment probability of a section of the unstructured data based on the assigned sentiments of words in the section; and determining an overall sentiment probability distribution for the unstructured documents based on the assigned sectional sentiment of multiple sections of the unstructured text documents; determining, by a computing system, a first set of topics within the unstructured text documents of the set and determining a second set of topics based on frequently occurring terms within the set of unstructured text documents; determining, by the computing system, a label for each term in the frequently occurring terms; determining, by the computing system, one or more text patterns, wherein the one or more text patterns are used to identify if a term is contained within a document; and creating, by the computing system, a category model to organize the identified terms as themes in a hierarchical structure that includes top level themes, wherein a topic having neutral sentiment is identified as a theme when the topic is located in a section of the unstructured text documents with a sentiment probability that is greater than an overall sentiment probability distribution. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
-
-
26. A non-transitory computer-readable storage medium comprising instructions that when executed enable a natural language processing computing system to:
-
receiving a set of unstructured text documents to process; removing noise words from the unstructured text documents, to yield clean documents; initiate a language processing component configured to process unstructured data collected from the unstructured text documents; determining sentiment associated with each word in the clean documents, by; assigning to each word in the clean documents at least one of a positive sentiment, a negative sentiment, and a neutral sentiment, to yield assigned sentiments; determining a sentiment probability of a section of the unstructured data based on the assigned sentiments of words in the section; and determining an overall sentiment probability distribution for the unstructured documents based on the assigned sectional sentiment of multiple sections of the unstructured text documents; determining a first set of topics within the unstructured text documents of the set and determining a second set of topics based on frequently occurring terms within the set of unstructured text documents; determining a label for each term in the frequently occurring terms; determining one or more text patterns, wherein the one or more text patterns are used to identify if a term is contained within a document; and creating a category model to organize the identified terms as themes in a hierarchical structure that includes top level themes, wherein a topic having neutral sentiment is identified as a theme when the topic is located in a section of the unstructured text documents with a sentiment probability that is greater than an overall sentiment probability distribution. - View Dependent Claims (27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41)
-
Specification