×

Natural language processing of formatted documents

  • US 10,628,525 B2
  • Filed: 05/17/2017
  • Issued: 04/21/2020
  • Est. Priority Date: 05/17/2017
  • Status: Active Grant
First Claim
Patent Images

1. A computer-implemented method for processing text, the method comprising:

  • determining, by a computer, that a span of natural language text is associated with one or more formatting characteristics;

    applying, by the computer, optical character recognition (OCR) to the span of natural language text associated with the one or more formatting characteristics, whereinidentifying, by the computer, that the span of the natural language text is bold type by comparing the pixel thickness of the characters of the span of the natural language text to an average pixel thickness of the natural language text;

    identifying, by the computer, that the span of the natural language text is italics type by analyzing the angle of the pixels of the characters in the span of the natural language text;

    identifying, by the computer, that the span of the natural language text is underlined by analyzing the number of pixels in a consistent horizontal line underneath the characters in the span of the natural language text;

    identifying, by the computer, that the span of the natural language text is a subscript by recognizing that a numerical character is located slightly below the span of the natural language text;

    calculating, by the computer, integer offsets denoting the span of natural language text modified by the identified formatting characteristics;

    applying, by the computer, numerical integer offsets denoting a beginning and an end of the identified formatting characteristics;

    denoting, by the computer, a page number as well as a beginning character and an end character of the span of natural language text that is modified;

    converting, by the computer, the span of natural language text, denoted by the numerical integer offsets at the beginning and the end of the identified formatting characteristics, into a common analysis structure (CAS);

    generating, by the computer, a data structure for storage in memory comprising at least one of the one or more formatting characteristics, and a corresponding span of the natural language text, wherein the data structure comprises the CAS;

    appending, by the computer, the CAS to a CAS file;

    transmitting, by the computer, the generated CAS comprising the at least one of the one or more formatting characteristics and the corresponding span of the natural language text to a natural language processing (NLP) pipeline to identify an intent of the corresponding span of the natural language text;

    performing, by the computer, the one or more actions associated with the one or more formatting characteristics;

    associating, by the computer, a level of importance with the corresponding span of the natural language text based on the at least one of the one or more formatting characteristics, wherein the level of importance for the at least one of the one or more formatting characteristics is pre-configured;

    ranking, by the computer, the one or more actions associated with the identified one or more formatting characteristics, wherein bold type is ranked as more important than underline text; and

    incorporating, by the computer, the generated CAS data structure into a machine learning model that learns from and makes predictions on natural language text data.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×