Natural language processing—assisted extract, transform, and load techniques
First Claim
1. A computer-implemented method for mapping fields of an input document according to a first format, the method comprising:
- identifying, by execution of one or more processors, a plurality of first fields in the input document, wherein each first field includes an input descriptor and text content associated with the input descriptor;
identifying, by execution of the one or more processors, a plurality of mapping rules wherein each mapping rule specifies characteristics associated with a target field in a target format, wherein the characteristics comprise a target descriptor and a lexical answer type identifying lexical traits to locate in the plurality of first fields of the input document;
for each first field;
evaluating, via one or more natural language processing techniques, semantic properties of the input descriptor against the plurality of mapping rules to determine whether the input descriptor is consistent with one of the target fields;
evaluating, via one or more natural language processing techniques, semantic properties of the text content against the plurality of mapping rules to determine whether the text content is consistent with one of the target fields, based on the lexical answer type associated with the target field, and wherein evaluating further comprises;
determining, for each mapping rule, a descriptor score associated with the input descriptor and a content score associated with the text content, the descriptor score and the content score indicating a likelihood that the respective input descriptor and text content match the characteristics specified in the mapping rule; and
converging the descriptor score and the content score into a consolidated score based on a weighting between the descriptor score and the content score specified by the associated mapping rule;
determining, based on evaluating the semantic properties of the input descriptor and the text content against the plurality of mapping rules, that the first field corresponds to a target field; and
upon determining that the first field corresponds to the target field, defining a mapping from the first field to the corresponding target field;
generating a normalized document by mapping the text content of each first field to the respective corresponding target field; and
sending the generated normalized document to an extract-transform-load (ETL) system.
1 Assignment
0 Petitions
Accused Products
Abstract
Embodiments presented herein disclose techniques for transforming input documents having disparate formats into a normalized format (e.g., Atom, RSS, HTML, customized XML, etc.). According to one embodiment, a plurality of fields is identified in an input document that has a given format. Each field includes a descriptor and text content associated with the descriptor. For each field, semantic properties are evaluated for the descriptor and text content against a plurality of mapping rules to determine whether the field is consistent with one of a plurality of fields of a target format. Each mapping rule specifies characteristics associated with one of the fields in the target format. Once so determined, a mapping from the first field to the second field is defined.
44 Citations
4 Claims
-
1. A computer-implemented method for mapping fields of an input document according to a first format, the method comprising:
-
identifying, by execution of one or more processors, a plurality of first fields in the input document, wherein each first field includes an input descriptor and text content associated with the input descriptor; identifying, by execution of the one or more processors, a plurality of mapping rules wherein each mapping rule specifies characteristics associated with a target field in a target format, wherein the characteristics comprise a target descriptor and a lexical answer type identifying lexical traits to locate in the plurality of first fields of the input document; for each first field; evaluating, via one or more natural language processing techniques, semantic properties of the input descriptor against the plurality of mapping rules to determine whether the input descriptor is consistent with one of the target fields; evaluating, via one or more natural language processing techniques, semantic properties of the text content against the plurality of mapping rules to determine whether the text content is consistent with one of the target fields, based on the lexical answer type associated with the target field, and wherein evaluating further comprises; determining, for each mapping rule, a descriptor score associated with the input descriptor and a content score associated with the text content, the descriptor score and the content score indicating a likelihood that the respective input descriptor and text content match the characteristics specified in the mapping rule; and converging the descriptor score and the content score into a consolidated score based on a weighting between the descriptor score and the content score specified by the associated mapping rule; determining, based on evaluating the semantic properties of the input descriptor and the text content against the plurality of mapping rules, that the first field corresponds to a target field; and upon determining that the first field corresponds to the target field, defining a mapping from the first field to the corresponding target field; generating a normalized document by mapping the text content of each first field to the respective corresponding target field; and sending the generated normalized document to an extract-transform-load (ETL) system. - View Dependent Claims (2, 3, 4)
-
Specification