Inferring a dataset schema from input files
First Claim
Patent Images
1. A method comprising:
- receiving a data input file;
selecting a sample excerpt from the data input file, the sample excerpt comprising a subset of the data input file;
analyzing the sample excerpt to determine a row delimiter for the data input file, the row delimiter comprising one or more symbols that delimit each particular row of a plurality of rows in the data input file;
analyzing the sample excerpt to determine a column delimiter for the data input file, the column delimiter comprising one or more symbols that delimit each particular column of a plurality of columns in the data input file;
wherein analyzing the sample excerpt to determine a column delimiter for the data input file comprises;
using the row delimiter, identifying a plurality of rows;
identifying, in the plurality of rows, one or more candidate column delimiters;
for each candidate column delimiter of the one or more candidate column delimiters;
identifying a number of instances of the candidate column delimiter in each the plurality of rows;
determining a mode of the numbers of instances of the candidate column delimiter in each of the plurality of rows; and
computing a total deviation for the candidate column delimiter, the total deviation comprising a sum of deviations of the number of instances of the candidate column delimiter in each of the plurality of rows from the mode of the numbers of instances of the candidate column delimiter in each of the plurality of rows;
determining that a particular candidate column delimiter comprises a lowest total deviation of the candidate column delimiters and, in response, selecting the particular candidate column delimiter;
analyzing the sample excerpt to determine a plurality of data format types, each particular data format type corresponding to a particular column of each particular column of the plurality of columns in the data input file;
wherein analyzing the sample excerpt to determine a plurality of data format types comprises;
using the row delimiter, identifying a plurality of rows;
using the column delimiter, identifying a plurality of columns;
for each column of the plurality of columns performing;
parsing data in the plurality of rows using a plurality of data formats;
determining that data in one or more rows of the plurality of rows cannot be parsed with one or more first data formats of the plurality of data formats;
identifying one or more candidate data formats for the column excluding the one or more first data formats; and
selecting a second data format from the one or more candidate data formats;
using the column delimiter, row delimiter, and plurality of data format types to generate a candidate schema for the data input file;
using the candidate schema and the data input file, generating a plurality of sample rows and sample columns;
displaying the plurality of sample rows and sample columns through a graphical user interface;
wherein the method is performed using one or more processors.
8 Assignments
0 Petitions
Accused Products
Abstract
Techniques for generating a schema for a data input file are described herein. In an embodiment, a server computer receives a data input file. The server computer system selects a sample excerpt from the data input which comprises a subset of the data input file. The server computer system analyzes the sample excerpt to determine a row delimiter for the data input file, a column delimiter for the data input file, and a plurality of data format types. Using the column delimiter, row delimiter, and plurality of data format types, the server computer system generates a candidate schema for the data input file.
-
Citations
16 Claims
-
1. A method comprising:
- receiving a data input file;
selecting a sample excerpt from the data input file, the sample excerpt comprising a subset of the data input file; analyzing the sample excerpt to determine a row delimiter for the data input file, the row delimiter comprising one or more symbols that delimit each particular row of a plurality of rows in the data input file; analyzing the sample excerpt to determine a column delimiter for the data input file, the column delimiter comprising one or more symbols that delimit each particular column of a plurality of columns in the data input file; wherein analyzing the sample excerpt to determine a column delimiter for the data input file comprises; using the row delimiter, identifying a plurality of rows; identifying, in the plurality of rows, one or more candidate column delimiters; for each candidate column delimiter of the one or more candidate column delimiters; identifying a number of instances of the candidate column delimiter in each the plurality of rows; determining a mode of the numbers of instances of the candidate column delimiter in each of the plurality of rows; and computing a total deviation for the candidate column delimiter, the total deviation comprising a sum of deviations of the number of instances of the candidate column delimiter in each of the plurality of rows from the mode of the numbers of instances of the candidate column delimiter in each of the plurality of rows; determining that a particular candidate column delimiter comprises a lowest total deviation of the candidate column delimiters and, in response, selecting the particular candidate column delimiter; analyzing the sample excerpt to determine a plurality of data format types, each particular data format type corresponding to a particular column of each particular column of the plurality of columns in the data input file; wherein analyzing the sample excerpt to determine a plurality of data format types comprises; using the row delimiter, identifying a plurality of rows; using the column delimiter, identifying a plurality of columns; for each column of the plurality of columns performing; parsing data in the plurality of rows using a plurality of data formats; determining that data in one or more rows of the plurality of rows cannot be parsed with one or more first data formats of the plurality of data formats; identifying one or more candidate data formats for the column excluding the one or more first data formats; and selecting a second data format from the one or more candidate data formats; using the column delimiter, row delimiter, and plurality of data format types to generate a candidate schema for the data input file; using the candidate schema and the data input file, generating a plurality of sample rows and sample columns; displaying the plurality of sample rows and sample columns through a graphical user interface; wherein the method is performed using one or more processors. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- receiving a data input file;
-
12. A system comprising:
-
one or more processors; one or more storage media; one or more instructions stored in the storage media which, when executed by the one or more processors, cause performance of; receiving a data input file; selecting a sample excerpt from the data input file, the sample excerpt comprising a subset of the data input file; analyzing the sample excerpt to determine a row delimiter for the data input file, the row delimiter comprising one or more symbols that delimit each particular row of a plurality of rows in the data input file; analyzing the sample excerpt to determine a column delimiter for the data input file, the column delimiter comprising one or more symbols that delimit each particular column of a plurality of columns in the data input file; wherein analyzing the sample excerpt to determine a column delimiter for the data input file comprises; using the row delimiter, identifying a plurality of rows; identifying, in the plurality of rows, one or more candidate column delimiters; for each candidate column delimiter of the one or more candidate column delimiters; identifying a number of instances of the candidate column delimiter in each the plurality of rows; determining a mode of the numbers of instances of the candidate column delimiter in each of the plurality of rows; and computing a total deviation for the candidate column delimiter, the total deviation comprising a sum of deviations of the number of instances of the candidate column delimiter in each of the plurality of rows from the mode of the numbers of instances of the candidate column delimiter in each of the plurality of rows; determining that a particular candidate column delimiter comprises a lowest total deviation of the candidate column delimiters and, in response, selecting the particular candidate column delimiter; analyzing the sample excerpt to determine a plurality of data format types, each particular data format type corresponding to a particular column of each particular column of the plurality of columns in the data input file; wherein analyzing the sample excerpt to determine a plurality of data format types comprises; using the row delimiter, identifying a plurality of rows; using the column delimiter, identifying a plurality of columns; for each column of the plurality of columns performing; parsing data in the plurality of rows using a plurality of data formats; determining that data in one or more rows of the plurality of rows cannot be parsed with one or more first data formats of the plurality of data formats; identifying one or more candidate data formats for the column excluding the one or more first data formats; and selecting a second data format from the one or more candidate data formats; using the column delimiter, row delimiter, and plurality of data format types to generate a candidate schema for the data input file; using the candidate schema and the data input file, generating a plurality of sample rows and sample columns; displaying the plurality of sample rows and sample columns through a graphical user interface. - View Dependent Claims (13, 14, 15, 16)
-
Specification