×

Managing record format information

  • US 10,445,309 B2
  • Filed: 11/12/2010
  • Issued: 10/15/2019
  • Est. Priority Date: 11/13/2009
  • Status: Active Grant
First Claim
Patent Images

1. A method for discovery of record formats of data records for processing in a data processing system, the method including:

  • receiving, by the data processing system from a data source, a data stream including plural distinct records that have a record format, with the records having fields that have data values; and

    selecting by the data processing system a record format that corresponds to the format of the data source, with the record format being one of a plurality of distinct candidate record formats, by;

    accessing from storage the distinct candidate record formats, with each particular one of the distinct candidate record formats specifying a data type for each field of a group of one or more fields of that particular one of the distinct candidate record formats;

    for each of two or more particular candidate record formats accessed, parsing data in each of multiple ones of the received, distinct records with a parser that applies, to the data, a data type for a field that is specified by the particular candidate record format;

    for at least one of the two or more particular candidate record formats accessed,determining that the parser identifies one or more errors when attempting to parse data in at least one of the multiple ones of the received, distinct records; and

    responsive to that determination, storing results data that specifies the data type or the field that was not parsed; and

    for each of the two or more particular candidate record formats accessed, determining a measure of correspondence for the particular candidate record format based on an amount of data in each of the multiple ones of the received, distinct records that is successfully parsed by data types for those fields specified by the particular candidate record format, which measure of correspondence is based on an extent to which the particular candidate record format corresponds to the format of each of the multiple ones of the received, distinct records;

    wherein the determined measure of correspondence for the at least one of the two or more particular candidate record formats accessed is further based on a number of one or more errors that the parser identifies for one or more data types of one or more corresponding fields specified by the at least one of the two or more particular candidate record formats, as specified by the stored results data for the at least one of the two or more particular candidate record formats; and

    wherein the selected record format has a higher or equivalent measure of correspondence, relative to one or more other measures of correspondence for one or more other distinct candidate record formats.

View all claims
  • 3 Assignments
Timeline View
Assignment View
    ×
    ×