Techniques for application data scrubbing, reporting, and analysis
First Claim
Patent Images
1. A machine-implemented method for executing on a machine, comprising:
- acquiring, by the machine, a first schema for a first data source and a second schema for a second data source;
using, by the machine, the first and second schemas to parse both data sources based on syntax and structure defined in the first and second schemas to detect data types and patterns for the data types in both the data sources;
matching, by the machine, some first patterns associated with the first data source to other second patterns associated with the second data source in response to matching rules, the matching rules provide a link between the patterns detected in the first data source and the second data source, the matching rules obtained from a meta schema that ties the first schema to the second schema and the matching rules are acquired in response to a predefined policy that associates patterns or data types between the two schemas and the matching rules permit a first data type in the first data source to be mapped to a second data type in the second data source even when the first data type is different from the second data type;
generating, by the machine, a report that identifies the matched first patterns of the first data source to the second patterns of the second source and the report includes metrics for the first data source and the second data source, the metrics including pattern variations for both of the data types, frequency of a particular pattern for a particular one of the data types that occurs within one of the data sources, identifying data source entries where sub data types are missing under a parent data type when required to present in accordance with one of the data source schemas; and
iterating, by the machine and in response to interaction with a data analyst, the method processing based on modifications supplied by the data analyst for the report and the matching rules based on the metrics to produce a revised report for each iteration and on a last iteration producing a master data source that conforms to enterprise data policies and a final revised report that reports on a state of the first data source and the second data source that comprise the master data source.
9 Assignments
0 Petitions
Accused Products
Abstract
Techniques for application data scrubbing, reporting, and analysis are presented. A plurality of data sources are analyzed in accordance with their schemas and matching rules. Merging rules are applied to merge a number of data types across the data sources together. A report is produced for inspection and a master data source is generated. The processing can be iterated with rules modified in response to the report for purposes of refining the master data source.
-
Citations
25 Claims
-
1. A machine-implemented method for executing on a machine, comprising:
-
acquiring, by the machine, a first schema for a first data source and a second schema for a second data source; using, by the machine, the first and second schemas to parse both data sources based on syntax and structure defined in the first and second schemas to detect data types and patterns for the data types in both the data sources; matching, by the machine, some first patterns associated with the first data source to other second patterns associated with the second data source in response to matching rules, the matching rules provide a link between the patterns detected in the first data source and the second data source, the matching rules obtained from a meta schema that ties the first schema to the second schema and the matching rules are acquired in response to a predefined policy that associates patterns or data types between the two schemas and the matching rules permit a first data type in the first data source to be mapped to a second data type in the second data source even when the first data type is different from the second data type; generating, by the machine, a report that identifies the matched first patterns of the first data source to the second patterns of the second source and the report includes metrics for the first data source and the second data source, the metrics including pattern variations for both of the data types, frequency of a particular pattern for a particular one of the data types that occurs within one of the data sources, identifying data source entries where sub data types are missing under a parent data type when required to present in accordance with one of the data source schemas; and iterating, by the machine and in response to interaction with a data analyst, the method processing based on modifications supplied by the data analyst for the report and the matching rules based on the metrics to produce a revised report for each iteration and on a last iteration producing a master data source that conforms to enterprise data policies and a final revised report that reports on a state of the first data source and the second data source that comprise the master data source. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A machine-implemented method for executing on a machine, comprising:
-
interacting, by the machine, with a data analyst via an interface presented to the data analyst; receiving, by the machine, identifiers for data schemas and data sources associated with those data schemas from the data analyst via the interface; acquiring, by the machine, merge rules from the data analyst via the interface, wherein the merge rules identify conditions within the data sources for merging different data types defined in the data schemas together with one another, the analyst identifies a master schema that ties the data sources together via the merge rules included in the master schema, and matching rules are acquired for the master schema and in response to a predefined policy that associates patterns or data types between the two schemas and the matching rules permit a first data type in a first data source to be mapped to a second data type in a second data source even when the first data type is different from the second data type; parsing, by the machine, the data sources using the data schemas and enforcing the merge rules to produce a merge report and to produce a master data source that combines the data sources together in accordance with the merge rules, and the merge report having metrics, the metrics including pattern variations for the first and second data types, frequency of a particular pattern for a particular one of the data types that occurs within one of the data sources, identifying data source entries where sub data types are missing under a parent data type when required to present in accordance with one of the two schemas; and dynamically jumping, by the machine, from the merge report to an area in one of the data sources based on interaction from a user with the merge report while the user views the metrics. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A machine-implemented system, comprising:
-
a machine configured with a data analysis tool implemented in a non-transitory computer-readable medium and to execute on the machine; and the machine or a different machine of a network configured with a data analyzer implemented in a non-transitory computer-readable medium and to execute on the machine or the different machine; the data analysis tool is adapted to provide an interface to a data analyst that permits the data analyst to identify data sources for analysis, and the data analyzer is to acquire a separate data schema for each of the data sources and uses the data schemas to parse the data sources to identify data types and patterns, and the data analyzer uses merge rules and policies to merge some of the data types and their corresponding data from the data sources together in a master data source, the merge rules identified in a master schema that ties the data sources together, and matching rules are acquired for the master schema and in response to a predefined policy that associates patterns or data types between the two schemas, and the data analyzer produces a black list report having metrics that identifies areas in the data sources and the black list report is used as input to an automated script that serially access the data sources and make corrections at the areas identified in the back list report and the matching rules permit a first data type in a first data source to be mapped to a second data type in a second data source even when the first data type is different from the second data type, and wherein the metrics including pattern variations for the first and second data types, frequency of a particular pattern for a particular one of the first and second data types that occurs within one of the data sources, identifying data source entries where sub data types are missing under a parent data type when required to present in accordance with one of the two schemas. - View Dependent Claims (16, 17, 18, 19, 20)
-
-
21. A machine-implemented system, comprising:
-
multiple machines of a network configured with applications implemented in a non-transitory computer-readable medium and to process on the multiple machines; and a particular machine of the network configured with a data analyzer implemented in a non-transitory computer-readable medium and to process on the particular machine; each application produces application data defined by its own schema and the data analyzer parses the application data using the schemas and further uses merge rules and policies to map the application data to a master data source, the merge rules acquired from a master schema that ties the schemas together via the merge rules, and matching rules are acquired for the master schema and in response to a predefined policy that associates patterns or data types between the two schemas, and the matching rules permit a first data type in a first data source to be mapped to a second data type in the second data source even when a first data type is different from the second data type, the data analyzer also generates a merge report having metrics that a user can use to dynamically jump from an area identified in the merge report, while viewing the metrics, to a particular location in application data to dynamically make corrections in the data source, wherein the metrics including pattern variations for both of the first and second data types, frequency of a particular pattern for a particular one of the first and second data types that occurs within one of the data sources, identifying data source entries where sub data types are missing under a parent data type when required to present in accordance with one schemas. - View Dependent Claims (22, 23, 24, 25)
-
Specification