Text extraction module for contextual analysis engine
First Claim
1. A method for extracting text from digital content, the method comprising:
- receiving a webpage that includes at least one content item and an executable item that, when executed, generates dynamic content;
receiving a plurality of modular extraction rules, execution of which leverages functionality provided by a programming library;
injecting the programming library into a security sandbox;
invoking a headless browser in the security sandbox;
using the headless browser to process the webpage and execute the modular extraction rules in the security sandbox, thereby causing the dynamic content to be generated without displaying the webpage;
filtering the executable item from the webpage and the generated dynamic content to produce a filtered corpus of plain text that does not include the executable item, but that does include the at least one content item and the generated dynamic content; and
generating a hierarchical output schema that includes a first analyzer/enhancer sub-node that includes an identification of the modular extraction rules that were executed in the security sandbox, and a second analyzer/enhancer sub-node that includes the filtered corpus of plain text.
3 Assignments
0 Petitions
Accused Products
Abstract
A contextual analysis engine systematically extracts, analyzes and organizes digital content stored in an electronic file such as a webpage. Content can be extracted using a text extraction module which is capable of separating the content which is to be analyzed from less meaningful content such as format specifications and programming scripts. The resulting unstructured corpus of plain text can then be passed to a text analytics module capable of generating a structured categorization of topics included within the content. This structured categorization can be organized based on a content topic ontology which may have been previously defined or which may be developed in real-time. The systems disclosed herein optionally include an input/output interface capable of managing workflows of the text extraction module and the text analytics module, administering a cache of previously generated results, and interfacing with other applications that leverage the disclosed contextual analysis services.
-
Citations
21 Claims
-
1. A method for extracting text from digital content, the method comprising:
-
receiving a webpage that includes at least one content item and an executable item that, when executed, generates dynamic content; receiving a plurality of modular extraction rules, execution of which leverages functionality provided by a programming library; injecting the programming library into a security sandbox; invoking a headless browser in the security sandbox; using the headless browser to process the webpage and execute the modular extraction rules in the security sandbox, thereby causing the dynamic content to be generated without displaying the webpage; filtering the executable item from the webpage and the generated dynamic content to produce a filtered corpus of plain text that does not include the executable item, but that does include the at least one content item and the generated dynamic content; and generating a hierarchical output schema that includes a first analyzer/enhancer sub-node that includes an identification of the modular extraction rules that were executed in the security sandbox, and a second analyzer/enhancer sub-node that includes the filtered corpus of plain text. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system for text extraction, the system comprising:
-
an input interface configured to receive (a) digital content that includes a digital content item and an executable item that, when executed, generates dynamic content, and (b) a plurality of modular extraction rules, execution of which leverages functionality provided by a programming library; a security sandbox having provided therein the programming library; a headless browser, running in the security sandbox, that is configured to process the digital content and execute the modular extraction rules in the security sandbox, thereby causing the dynamic content to be generated without causing the digital content to be displayed; a text extraction module configured to filter the executable item from the digital content item and the generated dynamic content to produce a filtered corpus of unformatted text that does not include the executable item, but that does include the digital content item and the generated dynamic content; and a reporting and visualization module configured to generate a hierarchical output schema that includes a first analyzer/enhancer sub-node that includes an identification of the modular extraction rules that were executed in the security sandbox, and a second analyzer/enhancer sub-node that includes the filtered corpus of unformatted text. - View Dependent Claims (12, 13, 14, 15, 16)
-
-
17. A non-transient computer readable medium having instructions encoded thereon that, when executed by at least one processor, causes a text extraction process for digital content to be carried out, the process comprising:
-
receiving digital content that includes at least one content item and an executable item that, when executed, generates dynamic content, wherein the digital content relates to a plurality of topics; receiving a plurality of modular extraction rules, execution of which leverages functionality provided by a programming library; injecting the programming library into a security sandbox; invoking a document object model (DOM) processing library in the security sandbox to process the digital content and execute the modular extraction rules in the security sandbox, thereby causing the dynamic content to be generated without displaying the digital content; filtering the executable item from the digital content and the generated dynamic content to produce a filtered corpus of plain text that does not include the executable item, but that does include the at least one content item and the generated dynamic content; and generating a hierarchical output schema that includes a first analyzer/enhancer sub-node that includes an identification of the modular extraction rules that were executed in the security sandbox, and a second analyzer/enhancer sub-node that includes the filtered corpus of plain text. - View Dependent Claims (18, 19, 20, 21)
-
Specification