Automated systems and methods for textual extraction of relevant data elements from an electronic clinical document
First Claim
1. A computer-implemented method for extracting relevant data elements from an electronic file for conversion to tabular format, the method comprising:
- receiving, in a computing device, an Extensible Markup Language (XML) format file, the XML file having at least one loop with nested blocks, wherein each of the nested blocks has at least one data element, the at least one data element having an unstructured or semi-structured format;
extracting features from the data elements;
processing, with a processor of the computing device, the extracted features using a machine learning algorithm to estimate a column header value for the data elements relative to a data schema;
classifying, by the processor, the data elements from the XML file using the extracted features;
generating, by the processor, a configuration file which maps the column header value to the data elements of the XML file;
parsing the XML file using the configuration file to extract unstructured or semi-structured alphanumeric data values of the data elements from the XML file and convert the data elements to a structured tabular format; and
ingesting the structured tabular format of the data elements into a data analytics processing system.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method for extracting relevant data elements from a file for conversion to a tabular format includes a computing device receiving an XML format file having a loop with nested blocks. Each of the blocks has at least one data element. Features are extracted from each data element. These extracted features are processed using a machine learning algorithm to estimate a column header value for the data elements relative to a data schema. With the data element classified, a configuration file is generated to map the column header value to the data elements of the XML file. The configuration file is used to extract the data elements from the XML file to a tabular format. In the healthcare industry, the system and method may be used to extract relevant health information from a clinical document for conversion to a tabular format.
61 Citations
20 Claims
-
1. A computer-implemented method for extracting relevant data elements from an electronic file for conversion to tabular format, the method comprising:
-
receiving, in a computing device, an Extensible Markup Language (XML) format file, the XML file having at least one loop with nested blocks, wherein each of the nested blocks has at least one data element, the at least one data element having an unstructured or semi-structured format; extracting features from the data elements; processing, with a processor of the computing device, the extracted features using a machine learning algorithm to estimate a column header value for the data elements relative to a data schema; classifying, by the processor, the data elements from the XML file using the extracted features; generating, by the processor, a configuration file which maps the column header value to the data elements of the XML file; parsing the XML file using the configuration file to extract unstructured or semi-structured alphanumeric data values of the data elements from the XML file and convert the data elements to a structured tabular format; and ingesting the structured tabular format of the data elements into a data analytics processing system. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computer-implemented system for the extraction of relevant data elements from an electronic file for conversion to a tabular format, the system comprising:
-
a computing device receiving an Extensible Markup Language (XML) format file, the XML file having at least one loop with nested blocks, wherein each of the nested blocks has at least one data element, the at least one data element having an unstructured or semi-structured format; a processor of the computing device executing instructions for; extracting features from the data elements of an XML file; processing the extracted features using a machine learning algorithm to estimate a column header value for the data elements relative to a data schema; segregating the data elements from the XML file using the extracted features; and generating a configuration file which maps the column header value to the data elements of the XML file; parsing the XML file using the configuration file to extract unstructured or semi-structured alphanumeric data values of the data elements from the XML file and convert the data elements to a structured tabular format; and ingesting the structured tabular format of the data elements into a data analytics processing system. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A computer-implemented system for extracting relevant health information from a clinical document in an Extensible Markup Language (XML) format for conversion to a tabular format, the system comprising:
-
a first computing device receiving the clinical document, wherein the clinical document has a plurality of loops with a plurality of attributes describing data elements, the plurality of data elements having an unstructured or semi-structured format, wherein the plurality of data elements correspond to health information within the clinical document; a processor of the first computing device executing instructions for; extracting features from at least one of the plurality of attributes or the data elements of a clinical document using textual analysis; processing the extracted features using a machine learning algorithm to estimate a column header value for the data elements relative to a predefined data schema; segregating the data elements from the clinical document using the extracted features; and generating a configuration file which maps the column header value to the data elements of the clinical document using a key-value pair, where a key of the key-value pair provides a column header value name from a data-lake schema and a value from the key-value pair provides an XPath of the clinical document; at least one second computing device in communication with the first computing device, wherein, at the at least one second computing device, the configuration file is used to parse the data elements from the clinical document to a tabular format by extracting unstructured or semi-structured alphanumeric data values of the data elements from the clinical document and converting the data elements to a structured tabular format; and a data analytics processing system ingesting the structured tabular format of the data elements. - View Dependent Claims (18, 19, 20)
-
Specification