Automatically mining patterns for rule based data standardization systems
First Claim
Patent Images
1. A system for mining sub-patterns within a text data set, the system comprising:
- a data source to store the text data set; and
a processor configured with logic to;
find a set of N frequently occurring sub-patterns within the data set;
extract the N sub-patterns from the data set; and
cluster the extracted sub-patterns into K groups such that each extracted sub-pattern is placed within the same group with other extracted sub-patterns based upon a distance value D that determines a degree of similarity based upon a longest common substring between the sub-pattern and every other sub-pattern within the same group and also based upon values associated with characters or symbols for the sub-pattern and every other sub-pattern within the same group;
wherein the processor is configured to determine the distance value D between any two sub-patterns s1 and s2 of the N sub-patterns based upon the following equation;
1 Assignment
0 Petitions
Accused Products
Abstract
Computer program products and systems are provided for mining for sub-patterns within a text data set. The embodiments facilitate finding a set of N frequently occurring sub-patterns within the data set, extracting the N sub-patterns from the data set, and clustering the extracted sub-patterns into K groups, where each extracted sub-pattern is placed within the same group with other extracted sub-patterns based upon a distance value D that determines a degree of similarity between the sub-pattern and every other sub-pattern within the same group.
-
Citations
2 Claims
-
1. A system for mining sub-patterns within a text data set, the system comprising:
-
a data source to store the text data set; and a processor configured with logic to; find a set of N frequently occurring sub-patterns within the data set; extract the N sub-patterns from the data set; and cluster the extracted sub-patterns into K groups such that each extracted sub-pattern is placed within the same group with other extracted sub-patterns based upon a distance value D that determines a degree of similarity based upon a longest common substring between the sub-pattern and every other sub-pattern within the same group and also based upon values associated with characters or symbols for the sub-pattern and every other sub-pattern within the same group; wherein the processor is configured to determine the distance value D between any two sub-patterns s1 and s2 of the N sub-patterns based upon the following equation;
-
-
2. A computer program product for mining for sub-patterns within a text data set, the computer program product comprising:
-
a computer readable storage medium having computer readable program code embodied therewith, the computer readable program code configured to; find a set of N frequently occurring sub-patterns within the data set; extract the N sub-patterns from the data set; and cluster the extracted sub-patterns into K groups such that each extracted sub-pattern is placed within the same group with other extracted sub-patterns based upon a distance value D that determines a degree of similarity between the sub-pattern and every other sub-pattern within the same group; wherein the computer readable program code is configured to calculate the distance value D between any two sub-patterns s1 and s2 of the N sub-patterns based upon the following equation;
-
Specification