Text extraction module for contextual analysis engine

US 10,235,681 B2
Filed: 10/15/2013
Issued: 03/19/2019
Est. Priority Date: 10/15/2013
Status: Active Grant

First Claim

Patent Images

1. A method for extracting text from digital content, the method comprising:

receiving a webpage that includes at least one content item and an executable item that, when executed, generates dynamic content;

receiving a plurality of modular extraction rules, execution of which leverages functionality provided by a programming library;

injecting the programming library into a security sandbox;

invoking a headless browser in the security sandbox;

using the headless browser to process the webpage and execute the modular extraction rules in the security sandbox, thereby causing the dynamic content to be generated without displaying the webpage;

filtering the executable item from the webpage and the generated dynamic content to produce a filtered corpus of plain text that does not include the executable item, but that does include the at least one content item and the generated dynamic content; and

generating a hierarchical output schema that includes a first analyzer/enhancer sub-node that includes an identification of the modular extraction rules that were executed in the security sandbox, and a second analyzer/enhancer sub-node that includes the filtered corpus of plain text.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A contextual analysis engine systematically extracts, analyzes and organizes digital content stored in an electronic file such as a webpage. Content can be extracted using a text extraction module which is capable of separating the content which is to be analyzed from less meaningful content such as format specifications and programming scripts. The resulting unstructured corpus of plain text can then be passed to a text analytics module capable of generating a structured categorization of topics included within the content. This structured categorization can be organized based on a content topic ontology which may have been previously defined or which may be developed in real-time. The systems disclosed herein optionally include an input/output interface capable of managing workflows of the text extraction module and the text analytics module, administering a cache of previously generated results, and interfacing with other applications that leverage the disclosed contextual analysis services.

Citations

21 Claims

1. A method for extracting text from digital content, the method comprising:
- receiving a webpage that includes at least one content item and an executable item that, when executed, generates dynamic content;
  
  receiving a plurality of modular extraction rules, execution of which leverages functionality provided by a programming library;
  
  injecting the programming library into a security sandbox;
  
  invoking a headless browser in the security sandbox;
  
  using the headless browser to process the webpage and execute the modular extraction rules in the security sandbox, thereby causing the dynamic content to be generated without displaying the webpage;
  
  filtering the executable item from the webpage and the generated dynamic content to produce a filtered corpus of plain text that does not include the executable item, but that does include the at least one content item and the generated dynamic content; and
  
  generating a hierarchical output schema that includes a first analyzer/enhancer sub-node that includes an identification of the modular extraction rules that were executed in the security sandbox, and a second analyzer/enhancer sub-node that includes the filtered corpus of plain text.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein the webpage includes a plurality of content items encoded using a hypertext markup language (HTML) and a plurality of executable items encoded using JavaScript commands.
  - 3. The method of claim 1, whereinusing the headless browser to process the webpage further comprises executing a JavaScript element that forms part of the executable item;
    - andexecuting the JavaScript element causes the dynamic content to be generated.
  - 4. The method of claim 1, wherein:
    - the headless browser is provided with a jQuery framework; and
      
      the modular extraction rules invoke JavaScript commands that are implemented using the jQuery framework of the headless browser.
  - 5. The method of claim 1, wherein the at least one content item comprises a webpage header and webpage metadata.
  - 6. The method of claim 1, further comprising receiving, with the webpage, an operating parameter that defines at least one of the modular extraction rules.
  - 7. The method of claim 1, further comprising deriving a grammatical root word for a term included within the corpus of plain text.
  - 8. The method of claim 1, further comprising:
    - identifying a compound word in the corpus of plain text and breaking the identified compound word into constituent parts; and
      
      adding the constituent parts to the corpus of plain text.
  - 9. The method of claim 1, wherein the headless browser executes the modular extraction rules using a WebKit plugin.
  - 10. The method of claim 1, wherein the headless browser is provided by a plugin that is scriptable with a JavaScript application programming interface (API).

11. A system for text extraction, the system comprising:
- an input interface configured to receive (a) digital content that includes a digital content item and an executable item that, when executed, generates dynamic content, and (b) a plurality of modular extraction rules, execution of which leverages functionality provided by a programming library;
  
  a security sandbox having provided therein the programming library;
  
  a headless browser, running in the security sandbox, that is configured to process the digital content and execute the modular extraction rules in the security sandbox, thereby causing the dynamic content to be generated without causing the digital content to be displayed;
  
  a text extraction module configured to filter the executable item from the digital content item and the generated dynamic content to produce a filtered corpus of unformatted text that does not include the executable item, but that does include the digital content item and the generated dynamic content; and
  
  a reporting and visualization module configured to generate a hierarchical output schema that includes a first analyzer/enhancer sub-node that includes an identification of the modular extraction rules that were executed in the security sandbox, and a second analyzer/enhancer sub-node that includes the filtered corpus of unformatted text.
- View Dependent Claims (12, 13, 14, 15, 16)
- - 12. The system of claim 11, further comprising a localization stemmer configured to derive a grammatical root word for a term included within the corpus of unformatted text.
  - 13. The system of claim 11, further comprising a localization decompounder configured to (a) identify a compound word in the corpus of unformatted text, (b) break the identified compound word into constituent parts, and (c) add the constituent parts to the corpus of unformatted text.
  - 14. The system of claim 11, further comprising:
    - a localization stemmer configured to derive a grammatical root word for a term included within the corpus of unformatted text;
      
      a localization decompounder configured to (a) identify a compound word in the corpus of unformatted text, (b) break the identified compound word into constituent parts, and (c) add the constituent parts to the corpus of unformatted text; and
      
      a natural language module including language-specific grammatical rules used by the localization stemmer and the localization decompounder.
  - 15. The system of claim 11, further comprising a conversion tool configured to convert the digital content to a hypertext markup language (HTML) document before the headless browser processes the digital content.
  - 16. The system of claim 11, wherein the digital content is a webpage, and:
    - a first one of the modular extraction rules is configured to extract metadata from the webpage; and
      
      a second one of the modular extraction rules is configured to extract text and hyperlinks from the webpage.

17. A non-transient computer readable medium having instructions encoded thereon that, when executed by at least one processor, causes a text extraction process for digital content to be carried out, the process comprising:
- receiving digital content that includes at least one content item and an executable item that, when executed, generates dynamic content, wherein the digital content relates to a plurality of topics;
  
  receiving a plurality of modular extraction rules, execution of which leverages functionality provided by a programming library;
  
  injecting the programming library into a security sandbox;
  
  invoking a document object model (DOM) processing library in the security sandbox to process the digital content and execute the modular extraction rules in the security sandbox, thereby causing the dynamic content to be generated without displaying the digital content;
  
  filtering the executable item from the digital content and the generated dynamic content to produce a filtered corpus of plain text that does not include the executable item, but that does include the at least one content item and the generated dynamic content; and
  
  generating a hierarchical output schema that includes a first analyzer/enhancer sub-node that includes an identification of the modular extraction rules that were executed in the security sandbox, and a second analyzer/enhancer sub-node that includes the filtered corpus of plain text.
- View Dependent Claims (18, 19, 20, 21)
- - 18. The non-transient computer readable medium of claim 17, wherein the process further comprises sending the corpus of plain text to a text analytics module configured to generate a list of topics that is derived from the corpus of plain text, and that includes the plurality of topics.
  - 19. The non-transient computer readable medium of claim 17, wherein the process further comprises sending the corpus of plain text to a text analytics module configured to generate a list of topic keywords derived from the corpus of plain text.
  - 20. The non-transient computer readable medium of claim 17, whereinusing the DOM processing library to process the digital content further comprises executing a JavaScript element that forms part of the executable item;
    - andexecuting the JavaScript element causes the dynamic content to be generated.
  - 21. The non-transient computer readable medium of claim 17, wherein the DOM processing library is incorporated into a headless browser.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Adobe Inc.
Original Assignee
Adobe Inc.
Inventors
Chang, Walter, Chen, Chris, Sadler, Shone, Jared, David
Primary Examiner(s)
Coupe, Anita
Assistant Examiner(s)
Prasad, Nancy N

Application Number

US14/054,318
Publication Number

US 20150106157A1
Time in Patent Office

1,981 Days
Field of Search
US Class Current
CPC Class Codes

G06F 40/237 Lexical tools

G06Q 30/0201 Market modelling; Market an...

Text extraction module for contextual analysis engine

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Text extraction module for contextual analysis engine

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links