Natural language processing of formatted documents

US 10,628,525 B2
Filed: 05/17/2017
Issued: 04/21/2020
Est. Priority Date: 05/17/2017
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for processing text, the method comprising:

determining, by a computer, that a span of natural language text is associated with one or more formatting characteristics;

applying, by the computer, optical character recognition (OCR) to the span of natural language text associated with the one or more formatting characteristics, whereinidentifying, by the computer, that the span of the natural language text is bold type by comparing the pixel thickness of the characters of the span of the natural language text to an average pixel thickness of the natural language text;

identifying, by the computer, that the span of the natural language text is italics type by analyzing the angle of the pixels of the characters in the span of the natural language text;

identifying, by the computer, that the span of the natural language text is underlined by analyzing the number of pixels in a consistent horizontal line underneath the characters in the span of the natural language text;

identifying, by the computer, that the span of the natural language text is a subscript by recognizing that a numerical character is located slightly below the span of the natural language text;

calculating, by the computer, integer offsets denoting the span of natural language text modified by the identified formatting characteristics;

applying, by the computer, numerical integer offsets denoting a beginning and an end of the identified formatting characteristics;

denoting, by the computer, a page number as well as a beginning character and an end character of the span of natural language text that is modified;

converting, by the computer, the span of natural language text, denoted by the numerical integer offsets at the beginning and the end of the identified formatting characteristics, into a common analysis structure (CAS);

generating, by the computer, a data structure for storage in memory comprising at least one of the one or more formatting characteristics, and a corresponding span of the natural language text, wherein the data structure comprises the CAS;

appending, by the computer, the CAS to a CAS file;

transmitting, by the computer, the generated CAS comprising the at least one of the one or more formatting characteristics and the corresponding span of the natural language text to a natural language processing (NLP) pipeline to identify an intent of the corresponding span of the natural language text;

performing, by the computer, the one or more actions associated with the one or more formatting characteristics;

associating, by the computer, a level of importance with the corresponding span of the natural language text based on the at least one of the one or more formatting characteristics, wherein the level of importance for the at least one of the one or more formatting characteristics is pre-configured;

ranking, by the computer, the one or more actions associated with the identified one or more formatting characteristics, wherein bold type is ranked as more important than underline text; and

incorporating, by the computer, the generated CAS data structure into a machine learning model that learns from and makes predictions on natural language text data.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Detecting and incorporating formatting characteristics within natural language processing analytics. Source documents are ingested and the markup formatting language is identified by the program. Once identified, the markup language is parsed and examined for formatting characteristics, embedded notes, comments and other metadata. The formatting characteristics of the plain text are extracted, along with the plain text, and converted into a common analysis structure (CAS), or CAS-equivalent structure, which annotates the natural language text together with its respective formatting characteristics. The CAS or CAS-equivalent structures are stored and sent to a natural language processing pipeline for further analysis via complex algorithms and rules. The natural language processing results data are curated to reflect meaningful analysis of the extracted CAS or CAS-equivalent structure.

20 Citations

View as Search Results

11 Claims

1. A computer-implemented method for processing text, the method comprising:
- determining, by a computer, that a span of natural language text is associated with one or more formatting characteristics;
  
  applying, by the computer, optical character recognition (OCR) to the span of natural language text associated with the one or more formatting characteristics, whereinidentifying, by the computer, that the span of the natural language text is bold type by comparing the pixel thickness of the characters of the span of the natural language text to an average pixel thickness of the natural language text;
  
  identifying, by the computer, that the span of the natural language text is italics type by analyzing the angle of the pixels of the characters in the span of the natural language text;
  
  identifying, by the computer, that the span of the natural language text is underlined by analyzing the number of pixels in a consistent horizontal line underneath the characters in the span of the natural language text;
  
  identifying, by the computer, that the span of the natural language text is a subscript by recognizing that a numerical character is located slightly below the span of the natural language text;
  
  calculating, by the computer, integer offsets denoting the span of natural language text modified by the identified formatting characteristics;
  
  applying, by the computer, numerical integer offsets denoting a beginning and an end of the identified formatting characteristics;
  
  denoting, by the computer, a page number as well as a beginning character and an end character of the span of natural language text that is modified;
  
  converting, by the computer, the span of natural language text, denoted by the numerical integer offsets at the beginning and the end of the identified formatting characteristics, into a common analysis structure (CAS);
  
  generating, by the computer, a data structure for storage in memory comprising at least one of the one or more formatting characteristics, and a corresponding span of the natural language text, wherein the data structure comprises the CAS;
  
  appending, by the computer, the CAS to a CAS file;
  
  transmitting, by the computer, the generated CAS comprising the at least one of the one or more formatting characteristics and the corresponding span of the natural language text to a natural language processing (NLP) pipeline to identify an intent of the corresponding span of the natural language text;
  
  performing, by the computer, the one or more actions associated with the one or more formatting characteristics;
  
  associating, by the computer, a level of importance with the corresponding span of the natural language text based on the at least one of the one or more formatting characteristics, wherein the level of importance for the at least one of the one or more formatting characteristics is pre-configured;
  
  ranking, by the computer, the one or more actions associated with the identified one or more formatting characteristics, wherein bold type is ranked as more important than underline text; and
  
  incorporating, by the computer, the generated CAS data structure into a machine learning model that learns from and makes predictions on natural language text data.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 11)
- - 2. The method of claim 1, wherein the NLP pipeline comprises a Question and Answer (QA) pipeline, and wherein the input text comprises a question, the QA pipeline analyzing the question based on the at least one of the one or more formatting characteristics and the corresponding span of the natural language text in the question.
  - 3. The method of claim 1, wherein the NLP pipeline comprises a relationship extraction pipeline, the relationship extraction pipeline incorporating the generated data structure into its detection and classification of semantic relationships within a natural language text analysis.
  - 4. The method of claim 1, wherein the NLP pipeline comprises a syntax tree parsing pipeline, the syntax tree parsing pipeline incorporating the generated data structure into its construction of parse trees for sentences in natural language text.
  - 5. The method of claim 1, wherein the NLP pipeline comprises a text mining pipeline, the text mining pipeline incorporating the generated data structure into its text mining analytics.
  - 6. The method of claim 1, wherein determining, by a computer, that a natural language text is associated with one or more formatting characteristics comprises:
    - determining, by the computer, that the natural language text is structured; and
      
      identifying the one or more formatting characteristics by identifying formatting meta tags associated with the natural language text.
  - 7. The method of claim 1, further comprising:
    - associating, by the computer, the one or more formatting characteristics with one or more actions, the one or more actions comprising any one of;
      
      categorizing as irrelevant the span of natural language text associated with a strikethrough formatting characteristic;
      
      emphasizing the span of natural language text associated with an underline formatting characteristic;
      
      emphasizing the span of natural language text associated with a bold type formatting characteristic;
      
      emphasizing the span of natural language text associated with an italics formatting characteristic;
      
      categorizing as a chemical formula the span of natural language text associated with a subscript formatting characteristic; and
      
      categorizing as a mathematical formula the span of natural language text associated with a superscript formatting characteristic.
  - 8. The method of claim 7, further comprising:
    - ranking, by the computer, the action associated with the identified at least one formatting characteristic.
  - 11. The method of claim 1, further comprising:
    - ingesting, by the computer, one or more source documents,wherein the one or more source documents comprise voice to text metadata, and wherein the voice to text metadata includes a tone of voice and a volume indicator.

9. A computer program product for processing text, comprising a non-transitory tangible storage device having program code embodied therewith, the program code executable by a processor of a computer to perform a method, the method comprising:
- determining, by a processor, that a span of natural language text is associated with one or more formatting characteristics;
  
  applying, by the processor, optical character recognition (OCR) to the span of natural language text associated with the one or more formatting characteristics, whereinidentifying, by the processor, that the span of the natural language text is bold type by comparing the pixel thickness of the characters of the span of the natural language text to an average pixel thickness of the natural language text;
  
  identifying, by the processor, that the span of the natural language text is italics type by analyzing the angle of the pixels of the characters in the span of the natural language text;
  
  identifying, by the processor, that the span of the natural language text is underlined by analyzing the number of pixels in a consistent horizontal line underneath the characters in the span of the natural language text;
  
  identifying, by the processor, that the span of the natural language text is a subscript by recognizing that a numerical character is located slightly below the span of the natural language text;
  
  calculating, by the processor, integer offsets denoting the span of natural language text modified by the identified formatting characteristics;
  
  applying, by the processor, numerical integer offsets denoting a beginning and an end of the identified formatting characteristics;
  
  denoting, by the processor, a page number as well as a beginning character and an end character of the span of natural language text that is modified;
  
  converting, by the processor, the span of natural language text, denoted by the numerical integer offsets at the beginning and the end of the identified formatting characteristics, into a common analysis structure (CAS);
  
  generating, by the processor, a data structure for storage in memory comprising at least one of the one or more formatting characteristics, and a corresponding span of the natural language text, wherein the data structure comprises the CAS;
  
  appending, by the processor, the CAS to a CAS file;
  
  transmitting, by the processor, the generated CAS comprising the at least one of the one or more formatting characteristics and the corresponding span of the natural language text to a natural language processing (NLP) pipeline to identify an intent of the corresponding span of the natural language text;
  
  performing, by the processor, the one or more actions associated with the one or more formatting characteristics;
  
  associating, by the processor, a level of importance with the corresponding span of the natural language text based on the at least one of the one or more formatting characteristics, wherein the level of importance for the at least one of the one or more formatting characteristics is pre-configured;
  
  ranking, by the processor, the one or more actions associated with the identified one or more formatting characteristics, wherein bold type is ranked as more important than underline text; and
  
  incorporating, by the processor, the generated CAS data structure into a machine learning model that learns from and makes predictions on natural language text data.

10. A computer system, comprising:
- one or more computer devices each having one or more processors and one or more tangible storage devices; and
  
  a program embodied on at least one of the one or more storage devices, the program having a plurality of program instructions for execution by the one or more processors, the program instructions comprising instructions for;
  
  determining, by a computer, that a span of natural language text is associated with one or more formatting characteristics;
  
  applying, by the computer, optical character recognition (OCR) to the span of natural language text associated with the one or more formatting characteristics, whereinidentifying, by the computer, that the span of the natural language text is bold type by comparing the pixel thickness of the characters of the span of the natural language text to an average pixel thickness of the natural language text;
  
  identifying, by the computer, that the span of the natural language text is italics type by analyzing the angle of the pixels of the characters in the span of the natural language text;
  
  identifying, by the computer, that the span of the natural language text is underlined by analyzing the number of pixels in a consistent horizontal line underneath the characters in the span of the natural language text;
  
  identifying, by the computer, that the span of the natural language text is a subscript by recognizing that a numerical character is located slightly below the span of the natural language text;
  
  calculating, by the computer, integer offsets denoting the span of natural language text modified by the identified formatting characteristics;
  
  applying, by the computer, numerical integer offsets denoting a beginning and an end of the identified formatting characteristics;
  
  denoting, by the computer, a page number as well as a beginning character and an end character of the span of natural language text that is modified;
  
  converting, by the computer, the span of natural language text, denoted by the numerical integer offsets at the beginning and the end of the identified formatting characteristics, into a common analysis structure (CAS);
  
  generating, by the computer, a data structure for storage in memory comprising at least one of the one or more formatting characteristics, and a corresponding span of the natural language text, wherein the data structure comprises the CAS;
  
  appending, by the computer, the CAS to a CAS file;
  
  transmitting, by the computer, the generated CAS comprising the at least one of the one or more formatting characteristics and the corresponding span of the natural language text to a natural language processing (NLP) pipeline to identify an intent of the corresponding span of the natural language text;
  
  performing, by the computer, the one or more actions associated with the one or more formatting characteristics;
  
  associating, by the computer, a level of importance with the corresponding span of the natural language text based on the at least one of the one or more formatting characteristics, wherein the level of importance for the at least one of the one or more formatting characteristics is pre-configured;
  
  ranking, by the computer, the one or more actions associated with the identified one or more formatting characteristics, wherein bold type is ranked as more important than underline text; and
  
  incorporating, by the computer, the generated CAS data structure into a machine learning model that learns from and makes predictions on natural language text data.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Fink, Patrick W., McNeil, Kristin E., Parker, Philip E., Werts, David B.
Primary Examiner(s)
Mishra, Richa

Application Number

US15/597,212
Publication Number

US 20180336181A1
Time in Patent Office

1,070 Days
Field of Search

None
US Class Current
CPC Class Codes

G06F 40/109   Font handling; Temporal or ...

G06F 40/211   Syntactic parsing, e.g. bas...

G06F 40/30   Semantic analysis

G06N 20/00   Machine learning

G06V 30/10   Character recognition

G06V 30/40   Document-oriented image-bas...

Natural language processing of formatted documents

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

20 Citations

11 Claims

Specification

Solutions

Use Cases

Quick Links

Natural language processing of formatted documents

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

20 Citations

11 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links