Data quality issue detection through ontological inferencing
First Claim
Patent Images
1. A method for use in detecting data quality issues in one or more instances of an incoming set of data, the method comprising:
- determining a scope of an incoming data set to be receivable at a processor, wherein said determining a scope of the incoming data set comprises automatically perusing a plurality of instances of data structures comprising the incoming data set, and wherein each instance includes one or more entries;
obtaining, based on the determined scope, a domain ontology that includes a plurality of TBox statement collections that collectively comprise metadata describing desired or acceptable properties of data corresponding to the determined scope, and wherein the metadata in each TBox statement collection describes a) a plurality of data types, b) at least one format that each data type should have, and c) an indication of whether or not each format is required in order for the data to be considered compliant with the TBox statement collection, wherein at least one of the formats of at least one of the data types is indicated as being required in order for the data to be considered compliant with the TBox statement collection;
mapping, using the processor, the incoming data set to the domain ontology, wherein said mapping comprises linking specific data structures in the incoming data set to particular TBox statement collections of the obtained domain ontology; and
for each instance of the incoming data set;
identifying an anticipated TBox statement collection of the plurality of TBox statement collections of the domain ontology;
ascertaining, using the processor, whether the instance can be inferenced into the anticipated TBox statement collection of the domain ontology, wherein the instance can be inferenced into the anticipated TBox statement collection when the instance comprises an ABox statement that is compliant with the anticipated TBox statement collection, wherein the instance comprises an ABox statement that is compliant with the anticipated TBox statement collection when at least one data structure in the instance is the at least one of the data types having the at least one of the required formats;
determining, by the processor, that the instance is free of at least one data quality issue responsive to the instance being inferenced into the anticipated TBox statement collection; and
determining, by the processor, that the instance has at least one data quality issue responsive to the instance not being inferenced into the anticipated TBox statement collection, wherein when the processor determines that the instance has at least one data quality issue, the method further includes ascertaining whether the instance can be inferenced into any other TBox statement collections of the domain ontology, and wherein the processor;
determines that the at least one data quality issue comprises a structural and/or formatting issue associated with the instance when it is ascertained that the instance cannot be inferenced into any other TBox statement collections of the domain ontology; and
determines that the at least one data quality issue comprises a labeling issue associated with the instance when it is ascertained that the instance can be inferenced into another TBox statement collection of the domain ontology.
7 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods (e.g., utilities) for use in providing automated data quality detection that can be used multiple times across various domains and across multiple data quality spheres. A data structure or schema of an incoming data set is initially mapped to a desired data or knowledge state in a domain ontology made up of a number of TBox statements. Data quality issues in the incoming data set can then be detected as instances of the incoming data set are attempted to be inferenced against or otherwise matched to anticipated TBox statements of the domain ontology.
8 Citations
13 Claims
-
1. A method for use in detecting data quality issues in one or more instances of an incoming set of data, the method comprising:
-
determining a scope of an incoming data set to be receivable at a processor, wherein said determining a scope of the incoming data set comprises automatically perusing a plurality of instances of data structures comprising the incoming data set, and wherein each instance includes one or more entries; obtaining, based on the determined scope, a domain ontology that includes a plurality of TBox statement collections that collectively comprise metadata describing desired or acceptable properties of data corresponding to the determined scope, and wherein the metadata in each TBox statement collection describes a) a plurality of data types, b) at least one format that each data type should have, and c) an indication of whether or not each format is required in order for the data to be considered compliant with the TBox statement collection, wherein at least one of the formats of at least one of the data types is indicated as being required in order for the data to be considered compliant with the TBox statement collection; mapping, using the processor, the incoming data set to the domain ontology, wherein said mapping comprises linking specific data structures in the incoming data set to particular TBox statement collections of the obtained domain ontology; and for each instance of the incoming data set; identifying an anticipated TBox statement collection of the plurality of TBox statement collections of the domain ontology; ascertaining, using the processor, whether the instance can be inferenced into the anticipated TBox statement collection of the domain ontology, wherein the instance can be inferenced into the anticipated TBox statement collection when the instance comprises an ABox statement that is compliant with the anticipated TBox statement collection, wherein the instance comprises an ABox statement that is compliant with the anticipated TBox statement collection when at least one data structure in the instance is the at least one of the data types having the at least one of the required formats; determining, by the processor, that the instance is free of at least one data quality issue responsive to the instance being inferenced into the anticipated TBox statement collection; and determining, by the processor, that the instance has at least one data quality issue responsive to the instance not being inferenced into the anticipated TBox statement collection, wherein when the processor determines that the instance has at least one data quality issue, the method further includes ascertaining whether the instance can be inferenced into any other TBox statement collections of the domain ontology, and wherein the processor; determines that the at least one data quality issue comprises a structural and/or formatting issue associated with the instance when it is ascertained that the instance cannot be inferenced into any other TBox statement collections of the domain ontology; and determines that the at least one data quality issue comprises a labeling issue associated with the instance when it is ascertained that the instance can be inferenced into another TBox statement collection of the domain ontology. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A system for use in detecting data quality issues in one or more instances of an incoming set of data, the system comprising:
-
a processor; and a memory logically connected to the processor and comprising a set of computer readable instructions executable by the processor to; determine a scope of an incoming data set to be receivable at the processor, wherein the scope of the incoming data set is determined by automatically perusing a plurality of instances comprising the incoming data set, wherein each instance includes one or more entries; obtain, based on the determined scope, a domain ontology that includes a plurality of TBox statement collections that collectively comprise metadata describing desired or acceptable properties of data elements corresponding to the determined scope, and wherein the metadata in each TBox statement collection describes a) a plurality of data types, b) at least one format that each data type should have, and c) an indication of whether or not each format is required in order for the data to be considered compliant with the TBox statement collection, wherein at least one of the formats of at least one of the data types is indicated as being required in order for the data to be considered compliant with the TBox statement collection; map the incoming data set to the domain ontology, wherein the incoming data set is mapped to the domain ontology by linking specific data structures in the incoming data set to particular TBox statement collection of the obtained domain ontology; and for each instance of the incoming data set; identify an anticipated TBox statement collection of the plurality of TBox statement collections of the domain ontology; ascertain whether the instance can be inferred into the anticipated TBox statement collection of the domain ontology, wherein the instance can be inferenced into the anticipated TBox statement collection when the instance comprises an ABox statement that is compliant with the anticipated TBox statement collection, wherein the instance comprises an ABox statement that is compliant with the anticipated TBox statement collection when at least one data structure in the instance is the at least one of the data types having the at least one of the required formats; determine that the instance is free of at least one data quality issue responsive to the instance being inferenced into the anticipated TBox statement collection; and determine that the instance has at least one data quality issue responsive to the instance not being inferenced into the anticipated TBox statement collection, wherein when it is determined that the instance has at least one data quality issue, the set of computer readable instructions are further executable by the processor to ascertain whether the instance can be inferenced into any other TBox statement collections of the domain ontology, and wherein the set of computer readable instructions are further executable by the processor to; determine that the at least one data quality issue comprises a structural and/or formatting issue associated with the instance when it is ascertained that the instance cannot be inferenced into any other TBox statement collections of the domain ontology; and determine that the at least one data quality issue comprises a labeling issue associated with the instance when it is ascertained that the instance can be inferenced into another TBox statement collection of the domain ontology. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A non-transitory computer-readable storage medium storing computer-readable instructions, which when executed by a processor, cause the processor to perform a method that comprises:
-
determining a scope of an incoming data set to be receivable at the processor, wherein said determining a scope of the incoming data set comprises automatically perusing a plurality of instances of data structures comprising the incoming data set, and wherein each instance includes one or more entries; obtaining, based on the determined scope, a domain ontology that includes a plurality of TBox statement collections that collectively comprise metadata describing desired or acceptable properties of data corresponding to the determined scope, and wherein the metadata in each TBox statement collection describes a) a plurality of data types, b) at least one format that each data type should have, and c) an indication of whether or not each format is required in order for the data to be considered compliant with the TBox statement collection, wherein at least one of the formats of at least one of the data types is indicated as being required in order for the data to be considered compliant with the TBox statement collection; mapping the incoming data set to the domain ontology, wherein said mapping comprises linking specific data structures in the incoming data set to particular TBox statement collections of the obtained domain ontology; and for each instance of the incoming data set; identifying an anticipated TBox statement collection of the plurality of TBox statement collections of the domain ontology; ascertaining whether the instance can be inferenced into the anticipated TBox statement collection of the domain ontology, wherein the instance can be inferenced into the anticipated TBox statement collection when the instance comprises an ABox statement that is compliant with the anticipated TBox statement collection, wherein the instance comprises an ABox statement that is compliant with the anticipated TBox statement collection when at least one data structure in the instance is the at least one of the data types having the at least one of the required formats; determining that the instance is free of at least one data quality issue responsive to the instance being inferenced into the anticipated TBox statement collection; and determining that the instance has at least one data quality issue responsive to the instance not being inferenced into the anticipated TBox statement collection, wherein when the processor determines that the instance has at least one data quality issue, the method further includes ascertaining whether the instance can be inferenced into any other TBox statement collections of the domain ontology, and wherein the set of computer readable instructions when executed by the processor further cause the processor to; determine that the at least one data quality issue comprises a structural and/or formatting issue associated with the instance when it is ascertained that the instance cannot be inferenced into any other TBox statement collections of the domain ontology; and determine that the at least one data quality issue comprises a labeling issue associated with the instance when it is ascertained that the instance can be inferenced into another TBox statement collection of the domain ontology.
-
Specification