Managing record format information
First Claim
1. A method for discovery of record formats of data records for processing in a data processing system, the method including:
- receiving, by the data processing system from a data source, a data stream including plural distinct records that have a record format, with the records having fields that have data values; and
selecting by the data processing system a record format that corresponds to the format of the data source, with the record format being one of a plurality of distinct candidate record formats, by;
accessing from storage the distinct candidate record formats, with each particular one of the distinct candidate record formats specifying a data type for each field of a group of one or more fields of that particular one of the distinct candidate record formats;
for each of two or more particular candidate record formats accessed, parsing data in each of multiple ones of the received, distinct records with a parser that applies, to the data, a data type for a field that is specified by the particular candidate record format;
for at least one of the two or more particular candidate record formats accessed,determining that the parser identifies one or more errors when attempting to parse data in at least one of the multiple ones of the received, distinct records; and
responsive to that determination, storing results data that specifies the data type or the field that was not parsed; and
for each of the two or more particular candidate record formats accessed, determining a measure of correspondence for the particular candidate record format based on an amount of data in each of the multiple ones of the received, distinct records that is successfully parsed by data types for those fields specified by the particular candidate record format, which measure of correspondence is based on an extent to which the particular candidate record format corresponds to the format of each of the multiple ones of the received, distinct records;
wherein the determined measure of correspondence for the at least one of the two or more particular candidate record formats accessed is further based on a number of one or more errors that the parser identifies for one or more data types of one or more corresponding fields specified by the at least one of the two or more particular candidate record formats, as specified by the stored results data for the at least one of the two or more particular candidate record formats; and
wherein the selected record format has a higher or equivalent measure of correspondence, relative to one or more other measures of correspondence for one or more other distinct candidate record formats.
3 Assignments
0 Petitions
Accused Products
Abstract
Data is prepared for processing in a data processing system using format information. Data is received that includes records that have values for fields over an input device or port. A target record format for processing the data is determined. Multiple records are analyzed according to validation tests to determine whether the data matches candidate record formats. Each candidate record format specifies a format for each field, and each validation test corresponds to at least one candidate record format. In response to receiving results of the validation tests, the target record format is associated with the data based on at least one of: a candidate record format for which at least a partial match was determined according to at least one validation test, a parsed record format selected according to a data type associated with the data, and a constructed record format generated from an analysis of data characteristics.
-
Citations
17 Claims
-
1. A method for discovery of record formats of data records for processing in a data processing system, the method including:
-
receiving, by the data processing system from a data source, a data stream including plural distinct records that have a record format, with the records having fields that have data values; and selecting by the data processing system a record format that corresponds to the format of the data source, with the record format being one of a plurality of distinct candidate record formats, by; accessing from storage the distinct candidate record formats, with each particular one of the distinct candidate record formats specifying a data type for each field of a group of one or more fields of that particular one of the distinct candidate record formats; for each of two or more particular candidate record formats accessed, parsing data in each of multiple ones of the received, distinct records with a parser that applies, to the data, a data type for a field that is specified by the particular candidate record format; for at least one of the two or more particular candidate record formats accessed, determining that the parser identifies one or more errors when attempting to parse data in at least one of the multiple ones of the received, distinct records; and responsive to that determination, storing results data that specifies the data type or the field that was not parsed; and for each of the two or more particular candidate record formats accessed, determining a measure of correspondence for the particular candidate record format based on an amount of data in each of the multiple ones of the received, distinct records that is successfully parsed by data types for those fields specified by the particular candidate record format, which measure of correspondence is based on an extent to which the particular candidate record format corresponds to the format of each of the multiple ones of the received, distinct records; wherein the determined measure of correspondence for the at least one of the two or more particular candidate record formats accessed is further based on a number of one or more errors that the parser identifies for one or more data types of one or more corresponding fields specified by the at least one of the two or more particular candidate record formats, as specified by the stored results data for the at least one of the two or more particular candidate record formats; and wherein the selected record format has a higher or equivalent measure of correspondence, relative to one or more other measures of correspondence for one or more other distinct candidate record formats. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A computer-readable storage medium storing a computer program for discovery of record formats of data records for processing in a data processing system, the computer program including instructions for causing a computer to:
-
receive, from a data source, a data stream including plural distinct records that have a record format, with the records having fields that have data values; and select a record format that corresponds to the format of the data source, with the record format being one of a plurality of distinct candidate record formats, by; accessing from storage the distinct candidate record formats, with each particular one of the distinct candidate record formats specifying a data type for each field of a group of one or more fields of that particular one of the distinct candidate record formats; for each of two or more particular candidate record formats accessed, parsing data in each of multiple ones of the received, distinct records with a parser that applies, to the data, a data type for a field that is specified by the particular candidate record format; for at least one of the two or more particular candidate record formats accessed, determining that the parser identifies one or more errors when attempting to parse data in at least one of the multiple ones of the received, distinct records; and responsive to that determination, storing results data that specifies the data type or the field that was not parsed; and for each of the two or more particular candidate record formats accessed, determining a measure of correspondence for the particular candidate record format based on an amount of data in each of the multiple ones of the received, distinct records that is successfully parsed by data types for those fields specified by the particular candidate record format, which measure of correspondence is based on an extent to which the particular candidate record format corresponds to the format of each of the multiple ones of the received, distinct records; wherein the determined measure of correspondence for the at least one of the two or more particular candidate record formats accessed is further based on a number of one or more errors that the parser identifies for one or more data types of one or more corresponding fields specified by the at least one of the two or more particular candidate record formats, as specified by the stored results data for the at least one of the two or more particular candidate record formats; and wherein the selected record format has a higher or equivalent measure of correspondence, relative to one or more other measures of correspondence for one or more other distinct candidate record formats.
-
-
17. A computing system for discovery of record formats of data for processing in a data processing system, the computing system including:
-
an input port configured to receive from a data source a data stream that includes plural distinct records that each have a record format, with the records having fields that have data values; and at least one processor configured to; select a record format that corresponds to the format of the data source, with the record format being one of a plurality of distinct candidate record formats, and further configured to; access from a storage device the distinct candidate record formats, with each particular one of the distinct candidate record formats specifying a data type for each field of a group of one or more fields of that particular one of the distinct candidate record formats; and for each of two or more particular candidate record formats accessed, parse data in each of multiple ones of the received, distinct records with a parser that applies, to the data, a data type for a field that is specified by the particular candidate record format; for at least one of the two or more particular candidate record formats accessed, determine that the parser identifies one or more errors when attempting to parse data in at least one of the multiple ones of the received, distinct records; and responsive to that determination, store results data that specifies the data type or the field that was not parsed; and for each of the two or more particular candidate record formats accessed, determine a measure of correspondence for the particular candidate record format based on an amount of data in each of the multiple ones of the received, distinct records that is successfully parsed by data types for those fields specified by the particular candidate record format, which measure of correspondence is based on an extent to which the particular candidate record format corresponds to the format of each of the multiple ones of the received, distinct records; wherein the determined measure of correspondence for the at least one of the two or more particular candidate record formats accessed is further based on a number of one or more errors that the parser identifies for one or more data types of one or more corresponding fields specified by the at least one of the two or more particular candidate record formats, as specified by the stored results data for the at least one of the two or more particular candidate record formats; and wherein the selected record format has a higher or equivalent measure of correspondence, relative to one or more other measures of correspondence for one or more other distinct candidate record formats.
-
Specification