Contextual analysis engine

US 9,990,422 B2
Filed: 10/15/2013
Issued: 06/05/2018
Est. Priority Date: 10/15/2013
Status: Active Grant

First Claim

Patent Images

1. A method of analyzing digital content, the method comprising:

receiving a corpus of text;

extracting a plurality of n-grams from the corpus of text;

constructing a multi-dimensional document feature vector, wherein the multi-dimensional document feature vector includes at least a portion of the n-grams extracted from the corpus of text and a relevance factor corresponding to each of the n-grams included in the multi-dimensional document feature vector;

extracting a portion of topics included in a topic ontology, wherein each of the extracted topics is related to at least one of the n-grams included in the multi-dimensional document feature vector;

generating a hierarchical listing that includes the extracted topics, wherein the hierarchical listing comprises a first plurality of nodes in a first branch of the hierarchical listing, and a second plurality of nodes in a second branch of the hierarchical listing, and wherein a particular node in a particular branch of the hierarchical listing includes a particular extracted topic; and

assigning a relevancy score to the particular extracted topic, wherein the assigned relevancy score is based on (a) the relevance factor corresponding to an n-gram that is related to the particular extracted topic, and (b) relevancy scores assigned to other extracted topics included in the particular branch of the hierarchical listing,wherein the hierarchical listing has a hierarchical structure corresponding to a hierarchical structure of the topic ontology, such that topics extracted from a relatively higher ontology level are in a corresponding higher hierarchical level of the listing, and topics extracted from a relatively lower ontology level are in a corresponding lower hierarchical level of the hierarchical listing, andwherein the hierarchical listing includes an extracted topic that is not included in the plurality of n-grams extracted from the corpus of text.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A contextual analysis engine systematically extracts, analyzes and organizes digital content stored in an electronic file such as a webpage. Content can be extracted using a text extraction module which is capable of separating the content which is to be analyzed from less meaningful content such as format specifications and programming scripts. The resulting unstructured corpus of plain text can then be passed to a text analytics module capable of generating a structured categorization of topics included within the content. This structured categorization can be organized based on a content topic ontology which may have been previously defined or which may be developed in real-time. The systems disclosed herein optionally include an input/output interface capable of managing workflows of the text extraction module and the text analytics module, administering a cache of previously generated results, and interfacing with other applications that leverage the disclosed contextual analysis services.

37 Citations

View as Search Results

16 Claims

1. A method of analyzing digital content, the method comprising:
- receiving a corpus of text;
  
  extracting a plurality of n-grams from the corpus of text;
  
  constructing a multi-dimensional document feature vector, wherein the multi-dimensional document feature vector includes at least a portion of the n-grams extracted from the corpus of text and a relevance factor corresponding to each of the n-grams included in the multi-dimensional document feature vector;
  
  extracting a portion of topics included in a topic ontology, wherein each of the extracted topics is related to at least one of the n-grams included in the multi-dimensional document feature vector;
  
  generating a hierarchical listing that includes the extracted topics, wherein the hierarchical listing comprises a first plurality of nodes in a first branch of the hierarchical listing, and a second plurality of nodes in a second branch of the hierarchical listing, and wherein a particular node in a particular branch of the hierarchical listing includes a particular extracted topic; and
  
  assigning a relevancy score to the particular extracted topic, wherein the assigned relevancy score is based on (a) the relevance factor corresponding to an n-gram that is related to the particular extracted topic, and (b) relevancy scores assigned to other extracted topics included in the particular branch of the hierarchical listing,wherein the hierarchical listing has a hierarchical structure corresponding to a hierarchical structure of the topic ontology, such that topics extracted from a relatively higher ontology level are in a corresponding higher hierarchical level of the listing, and topics extracted from a relatively lower ontology level are in a corresponding lower hierarchical level of the hierarchical listing, andwherein the hierarchical listing includes an extracted topic that is not included in the plurality of n-grams extracted from the corpus of text.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein each of the relevance factors included in the multi-dimensional document feature vector is based on a frequency of the corresponding n-gram.
  - 3. The method of claim 1, wherein the corpus of text is unstructured.
  - 4. The method of claim 1, wherein the corpus of text is received from a text extraction module that includes a headless browser.
  - 5. The method of claim 1, further comprising ranking n-grams included in the multi-dimensional document feature vector according to their respective relevance factors.
  - 6. The method of claim 1, wherein the topics extracted from the topic ontology are selected for extraction by identifying n-grams within the topic ontology and topics from the topic ontology that have a parent relationship with at least one of the identified n-grams.
  - 7. The method of claim 1, wherein the topics extracted from the topic ontology are selected for extraction by identifying n-grams within the topic ontology and topics from the topic ontology that have a direct parent relationship with at least one of the identified n-grams.
  - 8. The method of claim 1, wherein the topics extracted from the topic ontology are selected for extraction by identifying n-grams within the topic ontology and topics from the topic ontology that have an indirect parent relationship with at least one of the identified n-grams.
  - 9. The method of claim 1, whereinthe particular node of the hierarchical listing further includes a frequency count for the particular extracted topic, andthe frequency count corresponds to a number of occurrences, in the corpus of text, of the n-gram that is related to the particular extracted topic.
  - 10. The method of claim 1, wherein the multi-dimensional document feature vector includes only n-grams that were extracted from the corpus of plain text.

11. A system for analyzing digital content, the system comprising:
- an n-gram extractor configured to extract a plurality of n-grams from an unstructured corpus of text;
  
  a topic model generator configured to construct a multi-dimensional document feature vector that includes at least a portion of the n-grams extracted from the unstructured corpus of text and a relevance factor corresponding to each of the n-grams included in the multi-dimensional document feature vector;
  
  a topic categorizer configured to extract a portion of topics included in a topic ontology, wherein each of the extracted topics is related to one of the n-grams included in the multi-dimensional document feature vector; and
  
  a text analytics module configured togenerate a hierarchical listing that includes the extracted topics, wherein the hierarchical listing comprises a first plurality of nodes in a first branch of the hierarchical listing, and a second plurality of nodes in a second branch of the hierarchical listing, and wherein a particular node in a particular branch of the hierarchical listing includes a particular extracted topic, andassign a relevancy score to the particular extracted topic, wherein the assigned relevancy score is based on (a) the relevance factor corresponding to an n-gram that is related to the particular extracted topic, and (b) relevancy scores assigned to other extracted topics included in the particular branch of the hierarchical listing;
  
  wherein at least one of the extracted topics is not included within the plurality of n-grams extracted from the unstructured corpus of text;
  
  wherein the hierarchical listing has a hierarchical structure corresponding to a hierarchical structure of the topic ontology, such that topics extracted from a relatively higher ontology level are in a corresponding higher hierarchical level of the listing, and topics extracted from a relatively lower ontology level are in a corresponding lower hierarchical level of the hierarchical listing; and
  
  wherein the hierarchical listing includes an extracted topic that is not included in the plurality of n-grams extracted from the corpus of text.
- View Dependent Claims (12, 13, 14, 15, 16)
- - 12. The system of claim 11, wherein the n-gram extractor is further configured to generate frequency data corresponding to the extracted plurality of n-grams, the frequency data comprising a frequency count.
  - 13. The system of claim 11, wherein the n-gram extractor is further configured to filter one or more stop words from the extracted plurality of n-grams.
  - 14. The system of claim 11, wherein the topics extracted from the topic ontology are selected for extraction by identifying n-grams within the topic ontology and topics from the topic ontology that have a parent relationship with at least one of the identified n-grams.
  - 15. The system of claim 11, further comprising a named entity extractor and a natural language parser, wherein:
    - the natural language parser is configured to identify one or more tagged noun expressions in the unstructured corpus of text;
      
      the named entity extractor is configured to extract additional topics from the unstructured corpus of text based on the one or more tagged noun expressions identified by the natural language parser; and
      
      the multi-dimensional document feature vector also includes at least a portion of the extracted additional topics.
  - 16. The system of claim 11, wherein the relevance factor is at least partially based on a spatial distribution of a given extracted n-gram within the unstructured corpus of text.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Adobe Inc.
Original Assignee
Adobe Systems Incorporated (Adobe Inc.)
Inventors
Chang, Walter
Primary Examiner(s)
Trujillo, James
Assistant Examiner(s)
Tessema, Aida

Application Number

US14/054,351
Publication Number

US 20150106078A1
Time in Patent Office

1,694 Days
Field of Search

None
US Class Current
CPC Class Codes

G06F 16/35 Clustering; Classification

Contextual analysis engine

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

37 Citations

16 Claims

Specification

Use Cases

Quick Links

Others

Contextual analysis engine

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

37 Citations

16 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others