Contextual analysis engine
First Claim
1. A method of analyzing digital content, the method comprising:
- receiving a corpus of text;
extracting a plurality of n-grams from the corpus of text;
constructing a multi-dimensional document feature vector, wherein the multi-dimensional document feature vector includes at least a portion of the n-grams extracted from the corpus of text and a relevance factor corresponding to each of the n-grams included in the multi-dimensional document feature vector;
extracting a portion of topics included in a topic ontology, wherein each of the extracted topics is related to at least one of the n-grams included in the multi-dimensional document feature vector;
generating a hierarchical listing that includes the extracted topics, wherein the hierarchical listing comprises a first plurality of nodes in a first branch of the hierarchical listing, and a second plurality of nodes in a second branch of the hierarchical listing, and wherein a particular node in a particular branch of the hierarchical listing includes a particular extracted topic; and
assigning a relevancy score to the particular extracted topic, wherein the assigned relevancy score is based on (a) the relevance factor corresponding to an n-gram that is related to the particular extracted topic, and (b) relevancy scores assigned to other extracted topics included in the particular branch of the hierarchical listing,wherein the hierarchical listing has a hierarchical structure corresponding to a hierarchical structure of the topic ontology, such that topics extracted from a relatively higher ontology level are in a corresponding higher hierarchical level of the listing, and topics extracted from a relatively lower ontology level are in a corresponding lower hierarchical level of the hierarchical listing, andwherein the hierarchical listing includes an extracted topic that is not included in the plurality of n-grams extracted from the corpus of text.
2 Assignments
0 Petitions
Accused Products
Abstract
A contextual analysis engine systematically extracts, analyzes and organizes digital content stored in an electronic file such as a webpage. Content can be extracted using a text extraction module which is capable of separating the content which is to be analyzed from less meaningful content such as format specifications and programming scripts. The resulting unstructured corpus of plain text can then be passed to a text analytics module capable of generating a structured categorization of topics included within the content. This structured categorization can be organized based on a content topic ontology which may have been previously defined or which may be developed in real-time. The systems disclosed herein optionally include an input/output interface capable of managing workflows of the text extraction module and the text analytics module, administering a cache of previously generated results, and interfacing with other applications that leverage the disclosed contextual analysis services.
37 Citations
16 Claims
-
1. A method of analyzing digital content, the method comprising:
-
receiving a corpus of text; extracting a plurality of n-grams from the corpus of text; constructing a multi-dimensional document feature vector, wherein the multi-dimensional document feature vector includes at least a portion of the n-grams extracted from the corpus of text and a relevance factor corresponding to each of the n-grams included in the multi-dimensional document feature vector; extracting a portion of topics included in a topic ontology, wherein each of the extracted topics is related to at least one of the n-grams included in the multi-dimensional document feature vector; generating a hierarchical listing that includes the extracted topics, wherein the hierarchical listing comprises a first plurality of nodes in a first branch of the hierarchical listing, and a second plurality of nodes in a second branch of the hierarchical listing, and wherein a particular node in a particular branch of the hierarchical listing includes a particular extracted topic; and assigning a relevancy score to the particular extracted topic, wherein the assigned relevancy score is based on (a) the relevance factor corresponding to an n-gram that is related to the particular extracted topic, and (b) relevancy scores assigned to other extracted topics included in the particular branch of the hierarchical listing, wherein the hierarchical listing has a hierarchical structure corresponding to a hierarchical structure of the topic ontology, such that topics extracted from a relatively higher ontology level are in a corresponding higher hierarchical level of the listing, and topics extracted from a relatively lower ontology level are in a corresponding lower hierarchical level of the hierarchical listing, and wherein the hierarchical listing includes an extracted topic that is not included in the plurality of n-grams extracted from the corpus of text. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system for analyzing digital content, the system comprising:
-
an n-gram extractor configured to extract a plurality of n-grams from an unstructured corpus of text; a topic model generator configured to construct a multi-dimensional document feature vector that includes at least a portion of the n-grams extracted from the unstructured corpus of text and a relevance factor corresponding to each of the n-grams included in the multi-dimensional document feature vector; a topic categorizer configured to extract a portion of topics included in a topic ontology, wherein each of the extracted topics is related to one of the n-grams included in the multi-dimensional document feature vector; and a text analytics module configured to generate a hierarchical listing that includes the extracted topics, wherein the hierarchical listing comprises a first plurality of nodes in a first branch of the hierarchical listing, and a second plurality of nodes in a second branch of the hierarchical listing, and wherein a particular node in a particular branch of the hierarchical listing includes a particular extracted topic, and assign a relevancy score to the particular extracted topic, wherein the assigned relevancy score is based on (a) the relevance factor corresponding to an n-gram that is related to the particular extracted topic, and (b) relevancy scores assigned to other extracted topics included in the particular branch of the hierarchical listing; wherein at least one of the extracted topics is not included within the plurality of n-grams extracted from the unstructured corpus of text; wherein the hierarchical listing has a hierarchical structure corresponding to a hierarchical structure of the topic ontology, such that topics extracted from a relatively higher ontology level are in a corresponding higher hierarchical level of the listing, and topics extracted from a relatively lower ontology level are in a corresponding lower hierarchical level of the hierarchical listing; and wherein the hierarchical listing includes an extracted topic that is not included in the plurality of n-grams extracted from the corpus of text. - View Dependent Claims (12, 13, 14, 15, 16)
-
Specification