Hierarchical information extraction using document segmentation and optical character recognition correction
First Claim
Patent Images
1. A method for providing extracted entity data from electronic documents, the method comprising:
- receiving entity data extracted from an electronic document, the electronic document comprising a scanned version of a hardcopy document;
selecting extracted entity data via two or more experts, each of the experts applying at least one unique business rule to organize at least a portion of the selected entity data into a desired format, wherein the at least one unique business rule comprises a set of slots that comprise properties that define conditions for filling the set of slots with table cell data that includes the extracted entity data;
preventing extraction of entity data from a section of the electronic document having distorted content by;
generating a first-order hidden markov model for each section of the electronic document, based upon a layout of the document;
applying the first-order hidden markov model to a section of the electronic document that includes distorted text to determine the most likely hidden states for the section;
aligning the section with characters extracted from the section of the electronic document; and
configuring one or more extractors and the two or more experts to ignore at least a portion of the electronic document determined to include distorted content, based upon the alignment;
assembling the selected entity data into desired formats;
filling a portion of the set of slots with the portions of the selected entity data; and
outputting a marked phrase from the organized entity data.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems, methods, and media for extracting and processing entity data included in an electronic document are provided herein. Methods may include executing one or more extractors to extract entity data within an electronic document based upon an extraction model for the document, selecting extracted entity data via one or more experts, each of the experts applying at least one business rule to organize at least a portion of the selected entity data into a desired format, and providing the organized entity data for use by an end user.
34 Citations
15 Claims
-
1. A method for providing extracted entity data from electronic documents, the method comprising:
-
receiving entity data extracted from an electronic document, the electronic document comprising a scanned version of a hardcopy document; selecting extracted entity data via two or more experts, each of the experts applying at least one unique business rule to organize at least a portion of the selected entity data into a desired format, wherein the at least one unique business rule comprises a set of slots that comprise properties that define conditions for filling the set of slots with table cell data that includes the extracted entity data; preventing extraction of entity data from a section of the electronic document having distorted content by; generating a first-order hidden markov model for each section of the electronic document, based upon a layout of the document; applying the first-order hidden markov model to a section of the electronic document that includes distorted text to determine the most likely hidden states for the section; aligning the section with characters extracted from the section of the electronic document; and configuring one or more extractors and the two or more experts to ignore at least a portion of the electronic document determined to include distorted content, based upon the alignment; assembling the selected entity data into desired formats; filling a portion of the set of slots with the portions of the selected entity data; and outputting a marked phrase from the organized entity data. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A system for providing extracted entity data from electronic documents, the system comprising:
-
two or more experts that each; receives entity data extracted from an electronic document, the electronic document comprising a scanned version of a hardcopy document; selects extracted entity data, each of the experts applying at least one unique business rule to organize at least a portion of the selected entity data into a desired format, wherein the at least one unique business rule comprises a set of slots that comprise properties that define conditions for filling the set of slots with table cell data that includes the extracted entity data; assembles the selected entity data into desired formats; and fills a portion of the set of slots with the portions of the selected entity data; a disambiguation module that prevents extraction of entity data from a section of the electronic document having distorted content by; generating a first-order hidden markov model for each section of the electronic document, based upon a layout of the document; applying the first-order hidden markov model to a section of the electronic document that includes distorted text to determine the most likely hidden states for the section; aligning the section with characters extracted from the section of the electronic document; and configuring one or more extractors and the two or more experts to ignore at least a portion of the electronic document determined to include distorted content, based upon the alignment; and an output generator that outputs a marked phrase from the organized entity data. - View Dependent Claims (8, 9, 10, 11, 12, 13)
-
-
14. A non-transitory computer readable storage media having a program embodied thereon, the program being executable by a processor to perform a method for extracting entity data from electronic documents, the method comprising:
-
receiving entity data extracted from an electronic document, the electronic document comprising a scanned version of a hardcopy document; normalizing the extracted entity data by applying a normalization scheme to the extracted entity data, the normalization scheme converting the extracted entity data, the normalization scheme converting the extracted entity data into a standardized format; selecting extracted entity data via two or more experts, each of the experts applying at least one unique business rule to organize at least a portion of the selected entity data into a desired format, wherein the at least one unique business rule comprises a set of slots that comprise properties that define conditions for filling the set of slots with table cell data that includes the extracted entity data; preventing extraction of entity data from a section of the electronic document having distorted content by; generating a first-order hidden markov model for each section of the electronic document, based upon a layout of the document; applying the first-order hidden markov model to a section of the electronic document that includes distorted text to determine the most likely hidden states for the section; aligning the section with characters extracted from the section of the electronic document; and configuring one or more extractors and the two or more experts to ignore at least a portion of the electronic document determined to include distorted content, based upon the alignment; executing table experts that produce special annotations that identify table cells for the electronic document which include the extracted and normalized entity data; assembling the selected entity data into desired formats; filling a portion of the set of slots with the portions of the selected entity data; and outputting a marked phrase from the organized entity data.
-
-
15. A method for disambiguation that prevents extraction of entity data from a section of an electronic document having distorted content, the method comprising:
-
generating a first-order hidden markov model for each section of an electronic document, based upon a layout of the document; applying the first-order hidden markov model to a section of the electronic document that includes distorted text to determine the most likely hidden states for the section; aligning the section with characters extracted from the section of the electronic document; configuring one or more extractors and two or more experts to ignore at least a portion of the electronic document determined to include distorted content, based upon the alignment; receiving entity data extracted from the electronic document, the electronic document comprising a scanned version of a hardcopy document; selecting extracted entity data via two or more experts, each of the experts applying at least one unique business rule to organize at least a portion of the selected entity data into a desired format, wherein the at least one unique business rule comprises a set of slots that comprise properties that define conditions for filling the set of slots with table cell data that includes the extracted entity data; assembling the selected entity data into desired formats; filling a portion of the set of slots with the portions of the selected entity data; and outputting a marked phrase from the organized entity data.
-
Specification