×

Inferring a dataset schema from input files

  • US 10,540,333 B2
  • Filed: 12/05/2018
  • Issued: 01/21/2020
  • Est. Priority Date: 07/20/2017
  • Status: Active Grant
First Claim
Patent Images

1. A method comprising:

  • receiving a data input file;

    selecting a sample excerpt from the data input file, the sample excerpt comprising a subset of the data input file;

    analyzing the sample excerpt to determine a row delimiter for the data input file, the row delimiter comprising one or more symbols that delimit each particular row of a plurality of rows in the data input file;

    analyzing the sample excerpt to determine a column delimiter for the data input file, the column delimiter comprising one or more symbols that delimit each particular column of a plurality of columns in the data input file;

    wherein analyzing the sample excerpt to determine the column delimiter for the data input file comprises;

    using the row delimiter, identifying a plurality of rows;

    identifying, in the plurality of rows, one or more candidate column delimiters;

    for each candidate column delimiter of the one or more candidate column delimiters;

    identifying a number of instances of the candidate column delimiter in each the plurality of rows;

    determining a mode of the numbers of instances of the candidate column delimiter in each of the plurality of rows; and

    computing a total deviation for the candidate column delimiter, the total deviation comprising a sum of deviations of the number of instances of the candidate column delimiter in each of the plurality of rows from the mode of the numbers of instances of the candidate column delimiter in each of the plurality of rows;

    determining that a particular candidate column delimiter comprises a lowest total deviation of the candidate column delimiters and, in response, selecting the particular candidate column delimiter as the column delimiter;

    using the column delimiter and the row delimiter to generate a candidate schema for the data input file;

    using the candidate schema and the data input file, generating a plurality of sample rows and sample columns;

    displaying the plurality of sample rows and sample columns through the graphical user interface;

    wherein the method is performed using one or more processors.

View all claims
  • 7 Assignments
Timeline View
Assignment View
    ×
    ×