×

Hierarchical information extraction using document segmentation and optical character recognition correction

  • US 10,755,093 B2
  • Filed: 06/12/2017
  • Issued: 08/25/2020
  • Est. Priority Date: 01/27/2012
  • Status: Active Grant
First Claim
Patent Images

1. A method for providing extracted entity data from electronic documents, the method comprising:

  • receiving entity data extracted from an electronic document, the electronic document comprising a scanned version of a hardcopy document;

    selecting extracted entity data via two or more experts, each of the experts applying at least one unique business rule to organize at least a portion of the selected entity data into a desired format, wherein the at least one unique business rule comprises a set of slots that comprise properties that define conditions for filling the set of slots with table cell data that includes the extracted entity data;

    preventing extraction of entity data from a section of the electronic document having distorted content by;

    generating a first-order hidden markov model for each section of the electronic document, based upon a layout of the document;

    applying the first-order hidden markov model to a section of the electronic document that includes distorted text to determine the most likely hidden states for the section;

    aligning the section with characters extracted from the section of the electronic document; and

    configuring one or more extractors and the two or more experts to ignore at least a portion of the electronic document determined to include distorted content, based upon the alignment;

    assembling the selected entity data into desired formats;

    filling a portion of the set of slots with the portions of the selected entity data; and

    outputting a marked phrase from the organized entity data.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×