Method for matching pattern-based data
First Claim
1. A method of matching pattern-based data, comprising:
- extracting first distinct values from a first input dataset and second distinct values from a second input dataset;
generating a first pattern based on symbols appearing in the first distinct values and a second pattern based on symbols appearing in the second distinct values, the first and second patterns comprising nodes and one or more delimiters;
calculating support levels for the nodes;
removing one or more delimiters from the first and second patterns using the support levels;
wherein removing one or more delimiters from the first pattern further comprises calculating the support level at a node by summing support values of incoming transitions to that node;
initializing an expansion factor;
expanding a language of the first and second patterns at the expansion factor, the expanding of the language diminishing a size of the first and second patterns and decreasing a precision of the first and second patterns, wherein a number of distinct symbols allowed by the expanded language of the first pattern divided by a number of distinct symbols allowed by the non-expanded language of the first pattern equals the expansion factor and a number of distinct symbols allowed by the expanded language of the second pattern divided by a number of distinct symbols allowed by the non-expanded language of the second pattern equals the expansion factor;
incrementing the expansion factor and repeating the expanding of the language when the expansion factor is less than a predetermined value;
computing a similarity of the first pattern and the second pattern using the expanded language of the first and second patterns; and
matching the first input dataset with the second input dataset based on the similarity computation.
1 Assignment
0 Petitions
Accused Products
Abstract
A pattern-based data matching method matches pattern-based data. The data matching method generates a regular expression pattern for input datasets and describes similarity measures between the generated patterns. The data matching method analyzes an input dataset in terms of symbol classes, generalizing input values into a general pattern to allow identification or extrapolation of overlap between input datasets, aiding in matching fields in databases that are being merged and in learning a pattern for an input dataset. For each sequence of data values, the present method computes a compact pattern describing the sequence. Embodiments of the data matching method comprise noise reduction and repetitive pattern discovery in the input dataset and calculation of recall and precision of the generated pattern.
30 Citations
13 Claims
-
1. A method of matching pattern-based data, comprising:
-
extracting first distinct values from a first input dataset and second distinct values from a second input dataset; generating a first pattern based on symbols appearing in the first distinct values and a second pattern based on symbols appearing in the second distinct values, the first and second patterns comprising nodes and one or more delimiters; calculating support levels for the nodes; removing one or more delimiters from the first and second patterns using the support levels; wherein removing one or more delimiters from the first pattern further comprises calculating the support level at a node by summing support values of incoming transitions to that node; initializing an expansion factor; expanding a language of the first and second patterns at the expansion factor, the expanding of the language diminishing a size of the first and second patterns and decreasing a precision of the first and second patterns, wherein a number of distinct symbols allowed by the expanded language of the first pattern divided by a number of distinct symbols allowed by the non-expanded language of the first pattern equals the expansion factor and a number of distinct symbols allowed by the expanded language of the second pattern divided by a number of distinct symbols allowed by the non-expanded language of the second pattern equals the expansion factor; incrementing the expansion factor and repeating the expanding of the language when the expansion factor is less than a predetermined value; computing a similarity of the first pattern and the second pattern using the expanded language of the first and second patterns; and matching the first input dataset with the second input dataset based on the similarity computation. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A method for providing a matching pattern-based data service, comprising:
-
receiving first distinct values from a first input dataset and second distinct values from a second input dataset; invoking a hardware configuration utility, wherein the first distinct values and the second distinct values are made available to the hardware configuration utility for derivation of symbols of the first input dataset to generate a first pattern comprising nodes and one or more delimiters and derivation of symbols of the second input dataset to generate a second pattern comprising nodes and one or more delimiters, the hardware configuration utility further removing one or more delimiters from the first and second patterns using support levels calculated for the nodes, expanding a language of the first and second patterns to an expansion factor wherein a subset of symbols appearing in a node is expanded to a symbol class and a size of the first and second patterns is diminished broadening the language, and calculating a first and second precision by dividing a number of first and second distinct input values by a size of the language of the first and second patterns; and matching symbols and symbol classes of the first input dataset and the second input dataset from the hardware configuration utility based on similarity computation of the expanded language of the first pattern and the second pattern.
-
Specification