×

Text extraction module for contextual analysis engine

  • US 10,235,681 B2
  • Filed: 10/15/2013
  • Issued: 03/19/2019
  • Est. Priority Date: 10/15/2013
  • Status: Active Grant
First Claim
Patent Images

1. A method for extracting text from digital content, the method comprising:

  • receiving a webpage that includes at least one content item and an executable item that, when executed, generates dynamic content;

    receiving a plurality of modular extraction rules, execution of which leverages functionality provided by a programming library;

    injecting the programming library into a security sandbox;

    invoking a headless browser in the security sandbox;

    using the headless browser to process the webpage and execute the modular extraction rules in the security sandbox, thereby causing the dynamic content to be generated without displaying the webpage;

    filtering the executable item from the webpage and the generated dynamic content to produce a filtered corpus of plain text that does not include the executable item, but that does include the at least one content item and the generated dynamic content; and

    generating a hierarchical output schema that includes a first analyzer/enhancer sub-node that includes an identification of the modular extraction rules that were executed in the security sandbox, and a second analyzer/enhancer sub-node that includes the filtered corpus of plain text.

View all claims
  • 3 Assignments
Timeline View
Assignment View
    ×
    ×