SYSTEM FOR INFERRING DATA STRUCTURES
First Claim
Patent Images
1. An application server enabled to:
- analyze data to infer a structure of the data; and
generate a format specification based on the inferred structure of the data,wherein the format specification complies with a data description language.
1 Assignment
0 Petitions
Accused Products
Abstract
A system is disclosed for formulating structure descriptions from data. In some embodiments, data arrives with an unknown format. The data may be ad hoc data that is considered semi-structured. Disclosed embodiments analyze chunks of the data to determine tokens. Tokens are analyzed to identify base types and compound types such as structs, unions, and arrays. Descriptions are generated and undergo scoring and rewriting for optimization. The generated descriptions may be fed to a data description language such as Processing Ad Hoc Data System (PADS) and compiled for processing the raw data. In some embodiments, the raw data is parsed, printed, or reformatted using the generated descriptions.
151 Citations
24 Claims
-
1. An application server enabled to:
-
analyze data to infer a structure of the data; and generate a format specification based on the inferred structure of the data, wherein the format specification complies with a data description language. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method of determining data structures, the method comprising:
-
receiving raw data arranged into a plurality of chunks that include a plurality of fields that are separated by a plurality of instances of a delimiter; determining a quantity of the plurality of instances of the delimiter for a portion of the plurality of chunks; determining whether there are corresponding fields in the portion of the plurality of chunks; if there are corresponding fields in the portion of the plurality of chunks, determining whether a threshold number of entries have the same data class, wherein the threshold number of entries are from individual fields of a portion of the corresponding fields; and creating a description having a plurality of description entries, wherein a description entry specifies a data class for individual of the corresponding data fields. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
-
-
22. A computer program product stored on a computer readable media, the computer program product for determining data formats, the computer program product having instructions operable for:
-
lexing raw data to result in a plurality of tokens, wherein the raw data is understood to be arranged in a plurality of chunks, wherein each chunk contains a plurality of corresponding fields; on a field-by-field basis for a portion of the plurality of chunks, summing the occurrence counts of a token to result in a plurality of sums; determining from the plurality of sums whether each of the portion of the plurality of chunks has a threshold number of occurrence counts; if each portion of the plurality of chunks has a first threshold number of occurrence counts, including in a description a first indication that the token is a delimiter; determining for the portion of the plurality of chunks whether a second threshold number of occurrence counts occur within corresponding entries of a field, wherein each corresponding entry is from a different chunk of the portion of the plurality of chunks, and if a second threshold number of occurrence counts of a data class occur within corresponding entries of the field, including in the description a second indication that entries in the field are from the data class. - View Dependent Claims (23, 24)
-
Specification