Profiling data with source tracking
First Claim
1. A method for profiling data stored in a data storage system, the method including:
- accessing multiple collections of records stored in the data storage system over an interface coupled to the data storage system;
determining quantitative information for each of the multiple collections of records, the quantitative information for each particular collection including, for at least one selected field of the records in the particular collection, a corresponding list of value count entries, with each value count entry including a value appearing in at least the selected field and a count of the number of records in which the value appears in at least the selected field; and
processing the quantitative information of two or more of the collections to generate profiling summary information, the processing including;
merging the value count entries of corresponding lists for at least one field from each of at least a first collection and a second collection of the two or more collections to generate a combined list of value count entries, andaggregating value count entries of the combined list of value count entries to generate a list of distinct field value entries, at least some of the distinct field value entries identifying a distinct value from at least one of the value count entries;
wherein each value count entry in a list of value count entries corresponding to a particular collection further includes location information identifying respective locations of records within the particular collection of records in which the value appears in at least the selected field.
3 Assignments
0 Petitions
Accused Products
Abstract
Profiling data includes accessing multiple collections of records to store quantitative information for each particular collection including, for at least one selected field of the records in the particular collection, a corresponding list of value count entries, each including a value appearing in the selected field and a count of the number of records in which the value appears. Processing the quantitative information of two or more collections includes: merging the value count entries of corresponding lists for at least one field from each of a first collection and a second collection to generate a combined list of value count entries, and aggregating value count entries of the combined list of value count entries to generate a list of distinct field value entries identifying a distinct value and including information quantifying a number of records in which the distinct value appears for each of the two or more collections.
123 Citations
38 Claims
-
1. A method for profiling data stored in a data storage system, the method including:
-
accessing multiple collections of records stored in the data storage system over an interface coupled to the data storage system; determining quantitative information for each of the multiple collections of records, the quantitative information for each particular collection including, for at least one selected field of the records in the particular collection, a corresponding list of value count entries, with each value count entry including a value appearing in at least the selected field and a count of the number of records in which the value appears in at least the selected field; and processing the quantitative information of two or more of the collections to generate profiling summary information, the processing including; merging the value count entries of corresponding lists for at least one field from each of at least a first collection and a second collection of the two or more collections to generate a combined list of value count entries, and aggregating value count entries of the combined list of value count entries to generate a list of distinct field value entries, at least some of the distinct field value entries identifying a distinct value from at least one of the value count entries; wherein each value count entry in a list of value count entries corresponding to a particular collection further includes location information identifying respective locations of records within the particular collection of records in which the value appears in at least the selected field. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer program, stored on a computer-readable storage medium, for profiling data stored in a data storage system, the computer program including instructions for causing a computing system to:
-
access multiple collections of records stored in the data storage system over an interface coupled to the data storage system; determining quantitative information for each of the multiple collections of records, the quantitative information for each particular collection including, for at least one selected field of the records in the particular collection, a corresponding list of value count entries, with each value count entry including a value appearing in at least the selected field and a count of the number of records in which the value appears in at least the selected field; and process the quantitative information of two or more of the collections to generate profiling summary information, the processing including; merging the value count entries of corresponding lists for at least one field from each of at least a first collection and a second collection of the two or more collections to generate a combined list of value count entries, and aggregating value count entries of the combined list of value count entries to generate a list of distinct field value entries, at least some of the distinct field value entries identifying a distinct value from at least one of the value count entries; wherein each value count entry in a list of value count entries corresponding to a particular collection further includes location information identifying respective locations of records within the particular collection of records in which the value appears in at least the selected field. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A computing system for profiling data stored in a data storage system, the computing system including:
-
an interface coupled to the data storage system configured to access multiple collections of records stored in the data storage system; determining quantitative information for each of the multiple collections of records, the quantitative information for each particular collection including, for at least one selected field of the records in the particular collection, a corresponding list of value count entries, with each value count entry including a value appearing in at least the selected field and a count of the number of records in which the value appears in at least the selected field; and at least one processor configured to process the quantitative information of two or more of the collections to generate profiling summary information, the processing including; merging the value count entries of corresponding lists for at least one field from each of at least a first collection and a second collection of the two or more collections to generate a combined list of value count entries, and aggregating value count entries of the combined list of value count entries to generate a list of distinct field value entries, at least some of the distinct field value entries identifying a distinct value from at least one of the value count entries; wherein each value count entry in a list of value count entries corresponding to a particular collection further includes location information identifying respective locations of records within the particular collection of records in which the value appears in at least the selected field. - View Dependent Claims (16, 17, 18, 19, 20, 21)
-
-
22. A computing system for profiling data stored in a data storage system, the computing system including:
-
means for accessing multiple collections of records stored in the data storage system; means for determining quantitative information for each of the multiple collections of records, the quantitative information for each particular collection including, for at least one selected field of the records in the particular collection, a corresponding list of value count entries, with each value count entry including a value appearing in at least the selected field and a count of the number of records in which the value appears in at least the selected field; and means for processing the quantitative information of two or more of the collections to generate profiling summary information, the processing including; merging the value count entries of corresponding lists for at least one field from each of at least a first collection and a second collection of the two or more collections to generate a combined list of value count entries, and aggregating value count entries of the combined list of value count entries to generate a list of distinct field value entries, at least some of the distinct field value entries identifying a distinct value from at least one of the value count entries; wherein each value count entry in a list of value count entries corresponding to a particular collection further includes location information identifying respective locations of records within the particular collection of records in which the value appears in at least the selected field.
-
-
23. A method for profiling data stored in a data storage system, the method including:
-
accessing multiple collections of records stored in the data storage system over an interface coupled to the data storage system; and processing quantitative information of two or more of the collections to generate profiling summary information, the processing including; determining the quantitative information for each of the multiple collections of records, the quantitative information for each particular collection including, for at least one selected field of the records in the particular collection, a corresponding list of value count entries, with each value count entry including a value appearing in at least the selected field and a count of the number of records in which the value appears in at least the selected field, reading the value count entries of a corresponding list for at least one field from a first collection of the two or more collections to store output data that includes a list of distinct field value entries, and reading the value count entries of a corresponding list for at least one field from a second collection of the two or more collections to store updated output data based at least in part on the stored output data so that at least some of the distinct field value entries identify a distinct value from value count entries of corresponding lists for the first and second collections; wherein each value count entry in a list of value count entries corresponding to a particular collection further includes location information identifying respective locations of records within the particular collection of records in which the value appears in at least the selected field. - View Dependent Claims (24, 25, 26, 27)
-
-
28. A computer program, stored on a computer-readable storage medium, for profiling data stored in a data storage system, the computer program including instructions for causing a computing system to:
-
access multiple collections of records stored in the data storage system over an interface coupled to the data storage system; determine quantitative information for each of the multiple collections of records, the quantitative information for each particular collection including, for at least one selected field of the records in the particular collection, a corresponding list of value count entries, with each value count entry including a value appearing in at least the selected field and a count of the number of records in which the value appears in at least the selected field; and process the quantitative information of two or more of the collections to generate profiling summary information, the processing including; reading the value count entries of a corresponding list for at least one field from a first collection of the two or more collections to store output data that includes a list of distinct field value entries, and reading the value count entries of a corresponding list for at least one field from a second collection of the two or more collections to store updated output data based at least in part on the stored output data so that at least some of the distinct field value entries identify a distinct value from value count entries of corresponding lists for the first and second collections; wherein each value count entry in a list of value count entries corresponding to a particular collection further includes location information identifying respective locations of records within the particular collection of records in which the value appears in at least the selected field. - View Dependent Claims (29, 30, 31, 32)
-
-
33. A computing system for profiling data stored in a data storage system, the computing system including:
-
an interface coupled to the data storage system configured to access multiple collections of records stored in the data storage system; and at least one processor configured to process quantitative information of two or more of the collections to generate profiling summary information, the processing including; determining the quantitative information for each of the multiple collections of records, the quantitative information for each particular collection including, for at least one selected field of the records in the particular collection, a corresponding list of value count entries, with each value count entry including a value appearing in at least the selected field and a count of the number of records in which the value appears in at least the selected field, reading the value count entries of a corresponding list for at least one field from a first collection of the two or more collections to store output data that includes a list of distinct field value entries, and reading the value count entries of a corresponding list for at least one field from a second collection of the two or more collections to store updated output data based at least in part on the stored output data so that at least some of the distinct field value entries identify a distinct value from value count entries of corresponding lists for the first and second collections; wherein each value count entry in a list of value count entries corresponding to a particular collection further includes location information identifying respective locations of records within the particular collection of records in which the value appears in at least the selected field. - View Dependent Claims (34, 35, 36, 37)
-
-
38. A computing system for profiling data stored in a data storage system, the computing system including:
-
means for accessing multiple collections of records stored in the data storage system; means for determining quantitative information for each of the multiple collections of records, the quantitative information for each particular collection including, for at least one selected field of the records in the particular collection, a corresponding list of value count entries, with each value count entry including a value appearing in at least the selected field and a count of the number of records in which the value appears in at least the selected field; and means for processing the quantitative information of two or more of the collections to generate profiling summary information, the processing including; reading the value count entries of a corresponding list for at least one field from a first collection of the two or more collections to store output data that includes a list of distinct field value entries, and reading the value count entries of a corresponding list for at least one field from a second collection of the two or more collections to store updated output data based at least in part on the stored output data so that at least some of the distinct field value entries identify a distinct value from value count entries of corresponding lists for the first and second collections; wherein each value count entry in a list of value count entries corresponding to a particular collection further includes location information identifying respective locations of records within the particular collection of records in which the value appears in at least the selected field.
-
Specification