Method for matching pattern-based data

US 7,487,150 B2
Filed: 07/02/2005
Issued: 02/03/2009
Est. Priority Date: 07/02/2005
Status: Expired due to Fees

First Claim

Patent Images

1. A method of matching pattern-based data, comprising:

extracting first distinct values from a first input dataset and second distinct values from a second input dataset;

generating a first pattern based on symbols appearing in the first distinct values and a second pattern based on symbols appearing in the second distinct values, the first and second patterns comprising nodes and one or more delimiters;

calculating support levels for the nodes;

removing one or more delimiters from the first and second patterns using the support levels;

wherein removing one or more delimiters from the first pattern further comprises calculating the support level at a node by summing support values of incoming transitions to that node;

initializing an expansion factor;

expanding a language of the first and second patterns at the expansion factor, the expanding of the language diminishing a size of the first and second patterns and decreasing a precision of the first and second patterns, wherein a number of distinct symbols allowed by the expanded language of the first pattern divided by a number of distinct symbols allowed by the non-expanded language of the first pattern equals the expansion factor and a number of distinct symbols allowed by the expanded language of the second pattern divided by a number of distinct symbols allowed by the non-expanded language of the second pattern equals the expansion factor;

incrementing the expansion factor and repeating the expanding of the language when the expansion factor is less than a predetermined value;

computing a similarity of the first pattern and the second pattern using the expanded language of the first and second patterns; and

matching the first input dataset with the second input dataset based on the similarity computation.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A pattern-based data matching method matches pattern-based data. The data matching method generates a regular expression pattern for input datasets and describes similarity measures between the generated patterns. The data matching method analyzes an input dataset in terms of symbol classes, generalizing input values into a general pattern to allow identification or extrapolation of overlap between input datasets, aiding in matching fields in databases that are being merged and in learning a pattern for an input dataset. For each sequence of data values, the present method computes a compact pattern describing the sequence. Embodiments of the data matching method comprise noise reduction and repetitive pattern discovery in the input dataset and calculation of recall and precision of the generated pattern.

30 Citations

View as Search Results

13 Claims

1. A method of matching pattern-based data, comprising:
- extracting first distinct values from a first input dataset and second distinct values from a second input dataset;
  
  generating a first pattern based on symbols appearing in the first distinct values and a second pattern based on symbols appearing in the second distinct values, the first and second patterns comprising nodes and one or more delimiters;
  
  calculating support levels for the nodes;
  
  removing one or more delimiters from the first and second patterns using the support levels;
  
  wherein removing one or more delimiters from the first pattern further comprises calculating the support level at a node by summing support values of incoming transitions to that node;
  
  initializing an expansion factor;
  
  expanding a language of the first and second patterns at the expansion factor, the expanding of the language diminishing a size of the first and second patterns and decreasing a precision of the first and second patterns, wherein a number of distinct symbols allowed by the expanded language of the first pattern divided by a number of distinct symbols allowed by the non-expanded language of the first pattern equals the expansion factor and a number of distinct symbols allowed by the expanded language of the second pattern divided by a number of distinct symbols allowed by the non-expanded language of the second pattern equals the expansion factor;
  
  incrementing the expansion factor and repeating the expanding of the language when the expansion factor is less than a predetermined value;
  
  computing a similarity of the first pattern and the second pattern using the expanded language of the first and second patterns; and
  
  matching the first input dataset with the second input dataset based on the similarity computation.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1, wherein generating the first pattern is further based on a symbol class.
  - 3. The method of claim 2, wherein each symbol appears in a given character position in the first distinct values.
  - 4. The method of claim 2, further comprising partitioning symbol ranges into subsets;
    - andwherein each subset is generalized separately to a symbol class.
  - 5. The method of claim 1, further comprising removing noise from the first pattern during the expanding.
  - 6. The method of claim 5, further comprising computing the relative frequency of symbol occurrences appearing in a given character position and identifying as noise a symbol occurrence whose frequency is below a predetermined threshold value.
  - 7. The method of claim 1, further comprising replacing a repeating symbol sequence in the first pattern with a repetitive pattern.
  - 8. The method of claim 7, wherein the repetitive pattern comprises a sequence of symbols having a length ranging between a predetermined minimum length and a predetermined maximum length.
  - 9. The method of claim 1, further comprising calculating a recall value for the first pattern by measuring the amount of noise removed from the first input dataset.
  - 10. The method of claim 1, further comprising calculating a precision value for the first pattern by measuring the total expansion of the first pattern.
  - 11. The method of claim 1, wherein computing the similarity of the expanded language of the first and second patterns comprises measuring the rate of convergence of the first pattern and the second pattern to a universal pattern.
  - 12. The method of claim 1, wherein matching comprises comparing the similarity of the first pattern and the second pattern to a threshold;
    - anddetermining that the first input dataset matches the second input dataset if the similarity exceeds a predetermined threshold.

13. A method for providing a matching pattern-based data service, comprising:
- receiving first distinct values from a first input dataset and second distinct values from a second input dataset;
  
  invoking a hardware configuration utility, wherein the first distinct values and the second distinct values are made available to the hardware configuration utility for derivation of symbols of the first input dataset to generate a first pattern comprising nodes and one or more delimiters and derivation of symbols of the second input dataset to generate a second pattern comprising nodes and one or more delimiters, the hardware configuration utility further removing one or more delimiters from the first and second patterns using support levels calculated for the nodes, expanding a language of the first and second patterns to an expansion factor wherein a subset of symbols appearing in a node is expanded to a symbol class and a size of the first and second patterns is diminished broadening the language, and calculating a first and second precision by dividing a number of first and second distinct input values by a size of the language of the first and second patterns; and
  
  matching symbols and symbol classes of the first input dataset and the second input dataset from the hardware configuration utility based on similarity computation of the expanded language of the first pattern and the second pattern.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Myllymaki, Jussi Petri, Brown, Paul Geoffrey
Primary Examiner(s)
LeRoux; Etienne P
Assistant Examiner(s)
Rostami; Mohammad S

Application Number

US11/174,396
Publication Number

US 20070005596A1
Time in Patent Office

1,312 Days
Field of Search

707/6
US Class Current

1/1
CPC Class Codes

G06F 16/20   of structured data, e.g. re...

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99936   Pattern matching access

Method for matching pattern-based data

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

30 Citations

13 Claims

Specification

Solutions

Use Cases

Quick Links

Method for matching pattern-based data

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

30 Citations

13 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links