×

Inferring a dataset schema from input files

  • US 10,204,119 B1
  • Filed: 07/20/2017
  • Issued: 02/12/2019
  • Est. Priority Date: 07/20/2017
  • Status: Active Grant
First Claim
Patent Images

1. A method comprising:

  • receiving a data input file;

    selecting a sample excerpt from the data input file, the sample excerpt comprising a subset of the data input file;

    analyzing the sample excerpt to determine a row delimiter for the data input file, the row delimiter comprising one or more symbols that delimit each particular row of a plurality of rows in the data input file;

    analyzing the sample excerpt to determine a column delimiter for the data input file, the column delimiter comprising one or more symbols that delimit each particular column of a plurality of columns in the data input file;

    wherein analyzing the sample excerpt to determine a column delimiter for the data input file comprises;

    using the row delimiter, identifying a plurality of rows;

    identifying, in the plurality of rows, one or more candidate column delimiters;

    for each candidate column delimiter of the one or more candidate column delimiters;

    identifying a number of instances of the candidate column delimiter in each the plurality of rows;

    determining a mode of the numbers of instances of the candidate column delimiter in each of the plurality of rows; and

    computing a total deviation for the candidate column delimiter, the total deviation comprising a sum of deviations of the number of instances of the candidate column delimiter in each of the plurality of rows from the mode of the numbers of instances of the candidate column delimiter in each of the plurality of rows;

    determining that a particular candidate column delimiter comprises a lowest total deviation of the candidate column delimiters and, in response, selecting the particular candidate column delimiter;

    analyzing the sample excerpt to determine a plurality of data format types, each particular data format type corresponding to a particular column of each particular column of the plurality of columns in the data input file;

    wherein analyzing the sample excerpt to determine a plurality of data format types comprises;

    using the row delimiter, identifying a plurality of rows;

    using the column delimiter, identifying a plurality of columns;

    for each column of the plurality of columns performing;

    parsing data in the plurality of rows using a plurality of data formats;

    determining that data in one or more rows of the plurality of rows cannot be parsed with one or more first data formats of the plurality of data formats;

    identifying one or more candidate data formats for the column excluding the one or more first data formats; and

    selecting a second data format from the one or more candidate data formats;

    using the column delimiter, row delimiter, and plurality of data format types to generate a candidate schema for the data input file;

    using the candidate schema and the data input file, generating a plurality of sample rows and sample columns;

    displaying the plurality of sample rows and sample columns through a graphical user interface;

    wherein the method is performed using one or more processors.

View all claims
  • 8 Assignments
Timeline View
Assignment View
    ×
    ×