Inferring a dataset schema from input files
First Claim
Patent Images
1. A method comprising:
- receiving a data input file;
selecting a sample excerpt from the data input file, the sample excerpt comprising a subset of the data input file;
analyzing the sample excerpt to determine a row delimiter for the data input file, the row delimiter comprising one or more symbols that delimit each particular row of a plurality of rows in the data input file;
analyzing the sample excerpt to determine a column delimiter for the data input file, the column delimiter comprising one or more symbols that delimit each particular column of a plurality of columns in the data input file;
wherein analyzing the sample excerpt to determine the column delimiter for the data input file comprises;
using the row delimiter, identifying a plurality of rows;
identifying, in the plurality of rows, one or more candidate column delimiters;
for each candidate column delimiter of the one or more candidate column delimiters;
identifying a number of instances of the candidate column delimiter in each the plurality of rows;
determining a mode of the numbers of instances of the candidate column delimiter in each of the plurality of rows; and
computing a total deviation for the candidate column delimiter, the total deviation comprising a sum of deviations of the number of instances of the candidate column delimiter in each of the plurality of rows from the mode of the numbers of instances of the candidate column delimiter in each of the plurality of rows;
determining that a particular candidate column delimiter comprises a lowest total deviation of the candidate column delimiters and, in response, selecting the particular candidate column delimiter as the column delimiter;
using the column delimiter and the row delimiter to generate a candidate schema for the data input file;
using the candidate schema and the data input file, generating a plurality of sample rows and sample columns;
displaying the plurality of sample rows and sample columns through the graphical user interface;
wherein the method is performed using one or more processors.
7 Assignments
0 Petitions
Accused Products
Abstract
Techniques for generating a schema for a data input file are described herein. In an embodiment, a server computer receives a data input file. The server computer system selects a sample excerpt from the data input which comprises a subset of the data input file. The server computer system analyzes the sample excerpt to determine a row delimiter for the data input file, a column delimiter for the data input file, and a plurality of data format types. Using the column delimiter, row delimiter, and plurality of data format types, the server computer system generates a candidate schema for the data input file.
-
Citations
20 Claims
-
1. A method comprising:
-
receiving a data input file; selecting a sample excerpt from the data input file, the sample excerpt comprising a subset of the data input file; analyzing the sample excerpt to determine a row delimiter for the data input file, the row delimiter comprising one or more symbols that delimit each particular row of a plurality of rows in the data input file; analyzing the sample excerpt to determine a column delimiter for the data input file, the column delimiter comprising one or more symbols that delimit each particular column of a plurality of columns in the data input file; wherein analyzing the sample excerpt to determine the column delimiter for the data input file comprises; using the row delimiter, identifying a plurality of rows; identifying, in the plurality of rows, one or more candidate column delimiters; for each candidate column delimiter of the one or more candidate column delimiters; identifying a number of instances of the candidate column delimiter in each the plurality of rows; determining a mode of the numbers of instances of the candidate column delimiter in each of the plurality of rows; and computing a total deviation for the candidate column delimiter, the total deviation comprising a sum of deviations of the number of instances of the candidate column delimiter in each of the plurality of rows from the mode of the numbers of instances of the candidate column delimiter in each of the plurality of rows; determining that a particular candidate column delimiter comprises a lowest total deviation of the candidate column delimiters and, in response, selecting the particular candidate column delimiter as the column delimiter; using the column delimiter and the row delimiter to generate a candidate schema for the data input file; using the candidate schema and the data input file, generating a plurality of sample rows and sample columns; displaying the plurality of sample rows and sample columns through the graphical user interface; wherein the method is performed using one or more processors. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system comprising:
-
one or more processors; one or more storage media; one or more instructions stored in the storage media which, when executed by the one or more processors, cause performance of; receiving a data input file; selecting a sample excerpt from the data input file, the sample excerpt comprising a subset of the data input file; analyzing the sample excerpt to determine a row delimiter for the data input file, the row delimiter comprising one or more symbols that delimit each particular row of a plurality of rows in the data input file; analyzing the sample excerpt to determine a column delimiter for the data input file, the column delimiter comprising one or more symbols that delimit each particular column of a plurality of columns in the data input file; wherein analyzing the sample excerpt to determine the column delimiter for the data input file comprises; using the row delimiter, identifying a plurality of rows; identifying, in the plurality of rows, one or more candidate column delimiters; for each candidate column delimiter of the one or more candidate column delimiters; identifying a number of instances of the candidate column delimiter in each the plurality of rows; determining a mode of the numbers of instances of the candidate column delimiter in each of the plurality of rows; and computing a total deviation for the candidate column delimiter, the total deviation comprising a sum of deviations of the number of instances of the candidate column delimiter in each of the plurality of rows from the mode of the numbers of instances of the candidate column delimiter in each of the plurality of rows; determining that a particular candidate column delimiter comprises a lowest total deviation of the candidate column delimiters and, in response, selecting the particular candidate column delimiter as the column delimiter; using the column delimiter and the row delimiter to generate a candidate schema for the data input file; using the candidate schema and the data input file, generating a plurality of sample rows and sample columns; displaying the plurality of sample rows and sample columns through the graphical user interface; wherein the method is performed using one or more processors. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
Specification