Input gathering system and method for defining, refining or validating star schema for a source database
First Claim
1. One or more non-transitory computer readable storage mediums storing one or more sequences of instructions, which when executed by one or more processors, causes automatic gathering of inputs for defining, refining or validating star schema for a source database, by performing the steps of:
- linking the star schema to the source database through a communication network;
automatically analyzing queries on the source database, for defining, refining or validating the star schema, based on a usage analysis technique to gather any of (i) entities and columns, (ii) entity keys, (iii) relationships between and within entities, (iv) measures, (v) workflow attributes, or (vi) an update frequency associated with the entities and the columns of the source database;
automatically analyzing data in the source database, for defining, refining or validating the star schema, based on a data profiling technique to gather any of (i) entity keys, (ii) relationships between and within entities, (iii) measures, (iv) workflow attributes, and (v) specialized entities;
automatically determining a first ranking of any of (i) columns, (ii) entities, (iii) measures, (iv) relationships between and within entities, (v) workflow attributes, (vi) entity keys of the source database based on any of (a) the usage analysis technique, and (b) the data profiling technique;
automatically determining a second ranking for any of (i) columns, (ii) entities, (iii) measures, (iv) relationships between and within entities, (v) workflow attributes, (vi) entity keys of the source database by automatically gathering a ranking of any of (i) columns, (ii) entities, (iii) measures, (iv) relationships between and within entities, (v) workflow attributes, (vi) entity keys that are determined based on any of (a) the usage analysis technique, and (b) the data profiling technique; and
compiling ranked inputs of the usage analysis and data profiling to define, refine, or validate the star schema for the source database,wherein the one or more sequences of instructions, which when executed by the one or more processors further causes (a) automatic determination of measures comprising quantifiable metrics by analyzing queries on the source database for expressions on numeric attributes in a projection list, and (b) automatic ranking of expressions in an ordered list of candidate measures for expressions that occur across distinct queries,wherein the automatic gathering of inputs for defining, refining or validating the star schema for the source database increases an analytics efficiency to identify changes in a source warehouse schema containing data gathered from a plurality of databases in order to output a computer-generated dimensional model containing the star schema, andwherein all automatic processes are performed using a computer structured query programming method.
4 Assignments
0 Petitions
Accused Products
Abstract
An automated input gathering system for defining, refining or validating a data warehouse star schema for a given data source. The automated data input gathering system includes a source metadata analysis module, an operational reporting environment analysis module, a usage analysis module and a data profiling module. To automatically ascertain inputs for warehouse definition, refinement or validation like i) list of entities and most relevant ones ii) entity column types and lengths iii) entity keys iv) relationships between entities v) relationships within entities vi) entity and column update frequencies vii) grouping of entity and column updates viii) specialized entities ix) workflow and correlated attributes and x) commonly used measures the source metadata analysis module programmatically analyzes source metadata documentation or API'"'"'s; the operational reporting environment analysis module analyzes underlying models, reports and dashboards; the usage analysis module analyzes queries on the source database; and the data profiling module analyzes data in the source database.
39 Citations
21 Claims
-
1. One or more non-transitory computer readable storage mediums storing one or more sequences of instructions, which when executed by one or more processors, causes automatic gathering of inputs for defining, refining or validating star schema for a source database, by performing the steps of:
-
linking the star schema to the source database through a communication network; automatically analyzing queries on the source database, for defining, refining or validating the star schema, based on a usage analysis technique to gather any of (i) entities and columns, (ii) entity keys, (iii) relationships between and within entities, (iv) measures, (v) workflow attributes, or (vi) an update frequency associated with the entities and the columns of the source database; automatically analyzing data in the source database, for defining, refining or validating the star schema, based on a data profiling technique to gather any of (i) entity keys, (ii) relationships between and within entities, (iii) measures, (iv) workflow attributes, and (v) specialized entities; automatically determining a first ranking of any of (i) columns, (ii) entities, (iii) measures, (iv) relationships between and within entities, (v) workflow attributes, (vi) entity keys of the source database based on any of (a) the usage analysis technique, and (b) the data profiling technique; automatically determining a second ranking for any of (i) columns, (ii) entities, (iii) measures, (iv) relationships between and within entities, (v) workflow attributes, (vi) entity keys of the source database by automatically gathering a ranking of any of (i) columns, (ii) entities, (iii) measures, (iv) relationships between and within entities, (v) workflow attributes, (vi) entity keys that are determined based on any of (a) the usage analysis technique, and (b) the data profiling technique; and compiling ranked inputs of the usage analysis and data profiling to define, refine, or validate the star schema for the source database, wherein the one or more sequences of instructions, which when executed by the one or more processors further causes (a) automatic determination of measures comprising quantifiable metrics by analyzing queries on the source database for expressions on numeric attributes in a projection list, and (b) automatic ranking of expressions in an ordered list of candidate measures for expressions that occur across distinct queries, wherein the automatic gathering of inputs for defining, refining or validating the star schema for the source database increases an analytics efficiency to identify changes in a source warehouse schema containing data gathered from a plurality of databases in order to output a computer-generated dimensional model containing the star schema, and wherein all automatic processes are performed using a computer structured query programming method. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. One or more non-transitory computer readable storage mediums storing one or more sequences of instructions, which when executed by one or more processors, causes automatically gathering inputs for defining, refining or validating star schema for a source database, by performing the steps of:
-
linking the star schema to the source database through a communication network; automatically analyzing at least one of; (a) programmatic APIs or documentation associated with the source database to extract metadata for defining, refining or validating the star schema based on at least one of name patterns, or keywords, wherein the metadata comprises any of (i) entities and columns, (ii) column types and lengths, (iii) entity keys, (iv) relationships between and within the entities, (v) workflow attributes, and (vi) specialized entities corresponding to the source database; (b) queries on the source database based on a usage analysis technique to gather any of (i) entities and columns, (ii) entity keys, (iii) relationships between and within entities, (iv) measures, (v) workflow attributes, or (vi) an update frequency associated with the entities and the columns of the source database for defining, refining or validating the star schema;
or(c) data in the source database based on a data profiling technique to gather any of (i) entity keys, (ii) relationships between and within entities, (iii) measures, (iv) workflow attributes, and (v) specialized entities for defining, refining or validating the star schema; automatically analyzing an underlying model and queries in operational reports based on an operational reporting environment analysis technique to determine any of (i) candidate significant entities out of the entities of the source database, (ii) relationships between and within the entities, (iii) specialized entities, and (iv) relevant measures; and compiling inputs of the extracted metadata, usage analysis, data profiling, and operating reporting environment analysis to define, refine, or validate the star schema for the source database, wherein the one or more sequences of instructions, which when executed by the one or more processors further causes (a) automatic determination of measures comprising quantifiable metrics by analyzing queries on the source database for expressions on numeric attributes in a projection list, and (b) automatic ranking of expressions in an ordered list of candidate measures for expressions that occur across distinct queries, wherein the automatic gathering of inputs for defining, refining or validating the star schema for the source database increases an analytics efficiency to identify changes in a source warehouse schema containing data gathered from a plurality of databases in order to output a computer-generated dimensional model containing the star schema, and wherein all automatic processes are performed using a computer structured query programming method.
-
-
19. A hardware-configured processor implemented automated data warehouse input gathering system for defining, refining or validating star schema for a source database, the automated data warehouse data input gathering system comprising:
-
a communication network to link the star schema to the source database; a source metadata analysis module implemented by the processor that automatically analyzes programmatic APIs or documentation associated with the source database to extract metadata for defining, refining or validating the star schema based on at least one of (a) name patterns, or (b) keywords, wherein the metadata comprises any of (i) entities and columns, (ii) column types and lengths, (iii) entity keys, (iv) relationships between and within the entities, (v) workflow attributes, and (vi) specialized entities corresponding to the source database; a usage analysis module implemented by the processor that automatically analyzes queries on the source database to gather any of (i) entities and columns, (ii) entity keys, (iii) relationships between and within entities, (iv) measures, (v) workflow attributes, or (vi) an update frequency associated with the entities and the columns of the source database for defining, refining or validating the star schema, wherein the processor executes one or more sequences of instructions causing (a) automatic determination of measures comprising quantifiable metrics by analyzing queries on the source database for expressions on numeric attributes in a projection list, and (b) automatic ranking of expressions in an ordered list of candidate measures for expressions that occur across distinct queries; a data profiling module implemented by the processor that automatically analyzes data in the source database to gather any of (i) entity keys, (ii) relationships between and within entities, (iii) measures, (iv) workflow attributes, and (v) specialized entities for defining, refining or validating the star schema; and a rank determining module implemented by the processor that automatically determines a first ranking of any of (i) columns, (ii) entities, (iii) measures, (iv) relationships between and within entities, (v) workflow attributes, (vi) entity keys of the source database based on any of (a) the usage analysis technique, and (b) the data profiling technique, wherein the rank determining module automatically determines a second ranking for any of (i) columns, (ii) entities, (iii) measures, (iv) relationships between and within entities, (v) workflow attributes, (vi) entity keys of the source database by automatically gathering a ranking of any of (i) columns, (ii) entities, (iii) measures, (iv) relationships between and within entities, (v) workflow attributes, (vi) entity keys that are determined based on any of (a) the usage analysis technique, and (b) the data profiling technique, wherein a compiling of the extracted metadata and ranked inputs of the usage analysis and data profiling is used to define, refine, or validate the star schema for the source database, wherein automatic gathering of inputs for defining, refining or validating the star schema for the source database increases an analytics efficiency to identify changes in a source warehouse schema containing data gathered from a plurality of databases in order to output a computer-generated dimensional model containing the star schema, and wherein all automatic processes are performed using a computer structured query programming method. - View Dependent Claims (20)
-
-
21. A computer hardware-implemented method for automatically gathering inputs for defining, refining or validating star schema for a source database, comprising:
-
linking the star schema to the source database through a communication network; determining entities based on at least one of; (i) a usage analysis technique by automatically querying tables in a from clause on the source database, or (ii) a data profiling technique by automatically analyzing the tables and columns in the source database, or (iii) an operational reporting environment analysis technique by automatically analyzing an underlying model and queries in operational reports; determining entity keys based on at least one of; (i) the usage analysis technique by automatically analyzing queries on the source database for at least one of (a) join conditions of the columns with equality, or (b) a repetition of equality join conditions of the columns, or (ii) the data profiling technique by automatically analyzing (a) the source database, (b) columns with name patterns that are associated with a primary key or a foreign key, or (c) columns used for keys like integers or globally unique identifiers (GUID'"'"'s), wherein the entity keys associated with a repetition of equality join conditions across the queries are ranked in a ranked list of candidate entity keys; determining implicit relationships based on at least one of; (i) the usage analysis technique by automatically analyzing tables joined together to indicate relationships or hierarchies between tables in queries on the source database, or (ii) the data profiling technique by (a) automatically comparing data values in similar typed columns across tables while profiling source data, or (b) automatically comparing data values in similar typed columns across tables while profiling source data, starting with key columns, followed by columns with the same name, followed by columns with the same prefix but key name suffixes, followed by columns with key name suffixes, followed by columns with similar names, or (iii) the operational reporting environment analysis technique by automatically analyzing the underlying model and queries in the operational reports; determining workflow and workflow correlated attributes based on at least one of; (i) the usage analysis technique by (a) automatically analyzing queries on the source database that comprise conditions comparing a column to a set of predetermined enumerated values associated with a workflow, (b) automatically analyzing queries that comprises where conditions on workflow columns with group by column list of workflow correlated attributes, or (ii) the data profiling technique by (a) automatically determining columns having predetermined domain specific enumerated values associated with a workflow in the source database, or (b) automatically determining ID columns with values in a small range typically associated with workflow state transitions, or (c) automatically determining workflow correlated attributes that are specific to a workflow stage; determining specialized entities based on at least one of; (i) the data profiling technique by (a) automatically identifying unit of measure from the source database based on any of (i) name matching, and (ii) facts and measures that are constant multiplicative factors of each other indicating different unit of measures, or (b) automatically identifying currency attributes from the source database based on any of (i) name matching, and (ii) facts with multiplicative factors that vary that are specified in dedicated tables indicating different currencies, or (ii) the operational reporting environment analysis technique by automatically analyzing the underlying model and queries in the operational reports; determining commonly used measures comprising quantifiable metrics based on at least one of; (i) the usage analysis technique by automatically analyzing queries on the source database for expressions on numeric attributes in a projection list, or (ii) the operational reporting environment analysis technique by analyzing the underlying model and queries in the operational reports; automatically ranking expressions in an ordered list of candidate measures for expressions that occur across distinct queries; determining, based on the usage analysis technique, an update frequency associated with the entities and the columns in the source database based on a number of inserts, updates, or deletes in the entities and a track of which of the columns are updated; and compiling inputs of the usage analysis, data profiling, and operational reporting environment analysis to define, refine, or validate the star schema for the source database, wherein the automatic gathering of inputs for defining, refining or validating the star schema for the source database increases an analytics efficiency to identify changes in a source warehouse schema containing data gathered from a plurality of databases in order to output a computer-generated dimensional model containing the star schema, and wherein all automatic processes are performed using a computer structured query programming method.
-
Specification