Inferring a dataset schema from input files

US 10,540,333 B2
Filed: 12/05/2018
Issued: 01/21/2020
Est. Priority Date: 07/20/2017
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

receiving a data input file;

selecting a sample excerpt from the data input file, the sample excerpt comprising a subset of the data input file;

analyzing the sample excerpt to determine a row delimiter for the data input file, the row delimiter comprising one or more symbols that delimit each particular row of a plurality of rows in the data input file;

analyzing the sample excerpt to determine a column delimiter for the data input file, the column delimiter comprising one or more symbols that delimit each particular column of a plurality of columns in the data input file;

wherein analyzing the sample excerpt to determine the column delimiter for the data input file comprises;

using the row delimiter, identifying a plurality of rows;

identifying, in the plurality of rows, one or more candidate column delimiters;

for each candidate column delimiter of the one or more candidate column delimiters;

identifying a number of instances of the candidate column delimiter in each the plurality of rows;

determining a mode of the numbers of instances of the candidate column delimiter in each of the plurality of rows; and

computing a total deviation for the candidate column delimiter, the total deviation comprising a sum of deviations of the number of instances of the candidate column delimiter in each of the plurality of rows from the mode of the numbers of instances of the candidate column delimiter in each of the plurality of rows;

determining that a particular candidate column delimiter comprises a lowest total deviation of the candidate column delimiters and, in response, selecting the particular candidate column delimiter as the column delimiter;

using the column delimiter and the row delimiter to generate a candidate schema for the data input file;

using the candidate schema and the data input file, generating a plurality of sample rows and sample columns;

displaying the plurality of sample rows and sample columns through the graphical user interface;

wherein the method is performed using one or more processors.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques for generating a schema for a data input file are described herein. In an embodiment, a server computer receives a data input file. The server computer system selects a sample excerpt from the data input which comprises a subset of the data input file. The server computer system analyzes the sample excerpt to determine a row delimiter for the data input file, a column delimiter for the data input file, and a plurality of data format types. Using the column delimiter, row delimiter, and plurality of data format types, the server computer system generates a candidate schema for the data input file.

Citations

20 Claims

1. A method comprising:
- receiving a data input file;
  
  selecting a sample excerpt from the data input file, the sample excerpt comprising a subset of the data input file;
  
  analyzing the sample excerpt to determine a row delimiter for the data input file, the row delimiter comprising one or more symbols that delimit each particular row of a plurality of rows in the data input file;
  
  analyzing the sample excerpt to determine a column delimiter for the data input file, the column delimiter comprising one or more symbols that delimit each particular column of a plurality of columns in the data input file;
  
  wherein analyzing the sample excerpt to determine the column delimiter for the data input file comprises;
  
  using the row delimiter, identifying a plurality of rows;
  
  identifying, in the plurality of rows, one or more candidate column delimiters;
  
  for each candidate column delimiter of the one or more candidate column delimiters;
  
  identifying a number of instances of the candidate column delimiter in each the plurality of rows;
  
  determining a mode of the numbers of instances of the candidate column delimiter in each of the plurality of rows; and
  
  computing a total deviation for the candidate column delimiter, the total deviation comprising a sum of deviations of the number of instances of the candidate column delimiter in each of the plurality of rows from the mode of the numbers of instances of the candidate column delimiter in each of the plurality of rows;
  
  determining that a particular candidate column delimiter comprises a lowest total deviation of the candidate column delimiters and, in response, selecting the particular candidate column delimiter as the column delimiter;
  
  using the column delimiter and the row delimiter to generate a candidate schema for the data input file;
  
  using the candidate schema and the data input file, generating a plurality of sample rows and sample columns;
  
  displaying the plurality of sample rows and sample columns through the graphical user interface;
  
  wherein the method is performed using one or more processors.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, further comprising:
    - analyzing the sample excerpt to determine header data for the data input file, the header data comprising one or more strings in the data input file;
      
      wherein analyzing the sample excerpt to determine the column delimiter, the row delimiter, and the plurality of data format types comprises analyzing only data in the sample excerpt that is not included in the header data.
  - 3. The method of claim 2, wherein analyzing the sample excerpt to determine header data for the data input file comprises:
    - determining that a first row in the sample excerpt does not contain a delimited numeric value;
      
      determining that a second row in the sample excerpt following the first row does contain a delimited value;
      
      based, at least in part, on determining that the first row does not contain a delimited value and the second row does contain a delimited value, determining that the first row consists of header data.
  - 4. The method of claim 2, further comprising using the header data, extracting one or more column names for the plurality of columns.
  - 5. The method of claim 1, wherein analyzing the sample excerpt to determine a row delimiter for the data input file comprises:
    - storing row delimiter whitelist data comprising a plurality of candidate row delimiter;
      
      searching the sample excerpt to locate a particular row delimiter candidate, wherein the particular candidate row delimiter is a first occurrence of any of the plurality of candidate row delimiters;
      
      selecting the particular row delimiter candidate as the row delimiter for the data input file.
  - 6. The method of claim 1, wherein identifying the one or more candidate column delimiters comprises:
    - storing column delimiter whitelist data comprising the one or more candidate column delimiters;
      
      identifying the one more candidate column delimiters in the sample excerpt.
  - 7. The method of claim 1, wherein identifying the one or more candidate column delimiters comprises:
    - storing column delimiter whitelist data comprising a plurality of particular candidate column delimiters;
      
      storing column delimiter black list data comprising data identifying one or more symbols that are not candidate column delimiters;
      
      determining that the sample excerpt does not contain any of the plurality of particular candidate column delimiters;
      
      identifying, as the one or more candidate column delimiters, one or more symbols in the sample excerpt that are not contained in the column delimiter black list data.
  - 8. The method of claim 1, wherein identifying the one or more candidate column delimiters comprises:
    - storing column delimiter whitelist data comprising a plurality of particular candidate column delimiters;
      
      storing column delimiter blacklist data comprising data identifying one or more symbols that are not candidate column delimiters;
      
      identifying one or more particular candidate column delimiters in the sample excerpt;
      
      determining that a total deviation for the one or more particular column delimiters exceeds a stored deviation threshold and, in response, identifying, as at least one of the one or more candidate column delimiters, one or more symbols in the sample excerpt that are not contained in either the column delimiter whitelist data or the column delimiter blacklist data.
  - 9. The method of claim 1, wherein analyzing the sample excerpt to determine a column delimiter for the data input file comprises:
    - identifying, in the sample excerpt, one or more symbols following an open quotation and preceding a close quotation;
      
      identifying a particular symbol immediately following the close quotation;
      
      selecting the particular symbol as the column delimiter.
  - 10. The method of claim 1, further comprising:
    - displaying with the plurality of sample rows and sample columns, data identifying the plurality of data format types, the row delimiter, and the column delimiter;
      
      receiving, through the graphical user interface, input modifying one or more of the column delimiter, the row delimiter, or one or more of the plurality of data format types;
      
      in response to the input, performing;
      
      analyzing the sample excerpt to determine a second column delimiter for the data input file;
      
      analyzing the sample excerpt to determine a second row delimiter for the data input file;
      
      analyzing the sample excerpt to determine a second plurality of data format types;
      
      using the second column delimiter, second row delimiter, and second plurality of data format types to generate a second candidate schema for the data input file;
      
      using the second candidate schema and the data input file, generating a second plurality of sample rows and sample columns;
      
      displaying the second plurality of sample rows and sample columns through the graphical user interface.

11. A system comprising:
- one or more processors;
  
  one or more storage media;
  
  one or more instructions stored in the storage media which, when executed by the one or more processors, cause performance of;
  
  receiving a data input file;
  
  selecting a sample excerpt from the data input file, the sample excerpt comprising a subset of the data input file;
  
  analyzing the sample excerpt to determine a row delimiter for the data input file, the row delimiter comprising one or more symbols that delimit each particular row of a plurality of rows in the data input file;
  
  analyzing the sample excerpt to determine a column delimiter for the data input file, the column delimiter comprising one or more symbols that delimit each particular column of a plurality of columns in the data input file;
  
  wherein analyzing the sample excerpt to determine the column delimiter for the data input file comprises;
  
  using the row delimiter, identifying a plurality of rows;
  
  identifying, in the plurality of rows, one or more candidate column delimiters;
  
  for each candidate column delimiter of the one or more candidate column delimiters;
  
  identifying a number of instances of the candidate column delimiter in each the plurality of rows;
  
  determining a mode of the numbers of instances of the candidate column delimiter in each of the plurality of rows; and
  
  computing a total deviation for the candidate column delimiter, the total deviation comprising a sum of deviations of the number of instances of the candidate column delimiter in each of the plurality of rows from the mode of the numbers of instances of the candidate column delimiter in each of the plurality of rows;
  
  determining that a particular candidate column delimiter comprises a lowest total deviation of the candidate column delimiters and, in response, selecting the particular candidate column delimiter as the column delimiter;
  
  using the column delimiter and the row delimiter to generate a candidate schema for the data input file;
  
  using the candidate schema and the data input file, generating a plurality of sample rows and sample columns;
  
  displaying the plurality of sample rows and sample columns through the graphical user interface;
  
  wherein the method is performed using one or more processors.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 12. The system of claim 11, wherein the instructions, when executed by the one or more processors, further cause performance of:
    - analyzing the sample excerpt to determine header data for the data input file, the header data comprising one or more strings in the data input file;
      
      wherein analyzing the sample excerpt to determine the column delimiter, the row delimiter, and the plurality of data format types comprises analyzing only data in the sample excerpt that is not included in the header data.
  - 13. The system of claim 12, wherein analyzing the sample excerpt to determine header data for the data input file comprises:
    - determining that a first row in the sample excerpt does not contain a delimited numeric value;
      
      determining that a second row in the sample excerpt following the first row does contain a delimited value;
      
      based, at least in part, on determining that the first row does not contain a delimited value and the second row does contain a delimited value, determining that the first row consists of header data.
  - 14. The system of claim 12, wherein the instructions, when executed by the one or more processors, further cause performance of, using the header data, extracting one or more column names for the plurality of columns.
  - 15. The system of claim 11, wherein analyzing the sample excerpt to determine a row delimiter for the data input file comprises:
    - storing row delimiter whitelist data comprising a plurality of candidate row delimiter;
      
      searching the sample excerpt to locate a particular row delimiter candidate, wherein the particular candidate row delimiter is a first occurrence of any of the plurality of candidate row delimiters;
      
      selecting the particular row delimiter candidate as the row delimiter for the data input file.
  - 16. The system of claim 11, wherein identifying the one or more candidate column delimiters comprises:
    - storing column delimiter whitelist data comprising the one or more candidate column delimiters;
      
      identifying the one more candidate column delimiters in the sample excerpt.
  - 17. The system of claim 11, wherein identifying the one or more candidate column delimiters comprises:
    - storing column delimiter whitelist data comprising a plurality of particular candidate column delimiters;
      
      storing column delimiter blacklist data comprising data identifying one or more symbols that are not candidate column delimiters;
      
      identifying one or more particular candidate column delimiters in the sample excerpt;
      
      determining that a total deviation for the one or more particular column delimiters exceeds a stored deviation threshold and, in response, identifying, as at least one of the one or more candidate column delimiters, one or more symbols in the sample excerpt that are not contained in either the column delimiter whitelist data or the column delimiter blacklist data.
  - 18. The system of claim 11, wherein identifying the one or more candidate column delimiters comprises:
    - storing column delimiter whitelist data comprising a plurality of particular candidate column delimiters;
      
      storing column delimiter blacklist data comprising data identifying one or more symbols that are not candidate column delimiters;
      
      identifying one or more particular candidate column delimiters in the sample excerpt;
      
      determining that a total deviation for the one or more particular column delimiters exceeds a stored deviation threshold and, in response, identifying, as at least one of the one or more candidate column delimiters, one or more symbols in the sample excerpt that are not contained in either the column delimiter whitelist data or the column delimiter blacklist data.
  - 19. The system of claim 11, wherein analyzing the sample excerpt to determine a column delimiter for the data input file comprises:
    - identifying, in the sample excerpt, one or more symbols following an open quotation and preceding a close quotation;
      
      identifying a particular symbol immediately following the close quotation;
      
      selecting the particular symbol as the column delimiter.
  - 20. The system of claim 11, wherein the instructions, when executed by the one or more processors, further cause performance of:
    - displaying with the plurality of sample rows and sample columns, data identifying the plurality of data format types, the row delimiter, and the column delimiter;
      
      receiving, through the graphical user interface, input modifying one or more of the column delimiter, the row delimiter, or one or more of the plurality of data format types;
      
      in response to the input, performing;
      
      analyzing the sample excerpt to determine a second column delimiter for the data input file;
      
      analyzing the sample excerpt to determine a second row delimiter for the data input file;
      
      analyzing the sample excerpt to determine a second plurality of data format types;
      
      using the second column delimiter, second row delimiter, and second plurality of data format types to generate a second candidate schema for the data input file;
      
      using the second candidate schema and the data input file, generating a second plurality of sample rows and sample columns;
      
      displaying the second plurality of sample rows and sample columns through the graphical user interface.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Palantir Technologies Incorporated
Original Assignee
Palantir Technologies Incorporated
Inventors
Ackner, Nir, Lin, Eric
Primary Examiner(s)
Saeed, Usmaan
Assistant Examiner(s)
Bartlett, William P

Application Number

US16/210,984
Publication Number

US 20190108244A1
Time in Patent Office

412 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/211   Schema design and management

G06F 3/0638   Organizing or formatting or...

G06F 40/205   Parsing

Inferring a dataset schema from input files

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Inferring a dataset schema from input files

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links