×

Electronic document source ingestion for natural language processing systems

  • US 9,053,086 B2
  • Filed: 12/12/2012
  • Issued: 06/09/2015
  • Est. Priority Date: 12/10/2012
  • Status: Expired due to Fees
First Claim
Patent Images

1. A method, comprising:

  • receiving a plurality of electronic documents, wherein each electronic document is arranged according to a different, respective format comprising a plurality of headers;

    identifying a properties file associated with one of the electronic documents, the properties file defining a particular header of the respective format in the one electronic document, an action corresponding to a text portion associated with the particular header, and an extension class;

    instantiating a preprocessor for parsing the one electronic document based on the extension class, wherein the preprocessor is configured to parse only electronic documents arranged using the respective format;

    parsing the one electronic document to identify the particular header using one or more processors and the preprocessor;

    upon identifying the text portion associated with the particular header, performing the action to the text portion by assigning the text portion to a formatting element of a normalized format; and

    storing the text portion into a natural language processing (NLP) object based on the formatting element of the normalized format, wherein text in the NLP object is arranged based on the normalized format.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×