Automatically mining patterns for rule based data standardization systems
First Claim
Patent Images
1. A system for mining for sub-patterns within a text data set, the system comprising:
- a data source to store the text data set; and
a processor configured with logic to;
find a set of N frequently occurring sub-patterns within the text data set;
extract the N sub-patterns from the data set; and
cluster pairs of the extracted sub-patterns into K groups such that each pair of the extracted sub-patterns is placed within the same group with other pairs of the extracted sub-patterns based upon a respective distance value D of each pair of the extracted sub-patterns that determines a degree of similarity based upon a respective longest common substring between a respective first sub-pattern and a respective second sub-pattern of the each pair of the extracted sub-patterns within the same group and also based upon values associated with characters or symbols for the respective first sub-pattern and the respective second sub-pattern of the each pair of the extracted sub-patterns within the same group, wherein;
the value of each of the characters or the symbols is related to a respective probability value of each of the characters or the symbols in the text data set; and
the respective distance D of the each pair of the extracted sub-patterns is related to a sum of the values of each of the characters or the symbols of the respective first sub-pattern, a sum of the values of each of the characters or the symbols of the respective second sub-pattern, and a sum of the values of each of the characters or the symbols of a longest common substring between the respective first sub-pattern and the respective second sub-pattern of the each pair of the extracted sub-patterns.
1 Assignment
0 Petitions
Accused Products
Abstract
Computer program products and systems are provided for mining for sub-patterns within a text data set. The embodiments facilitate finding a set of N frequently occurring sub-patterns within the data set, extracting the N sub-patterns from the data set, and clustering the extracted sub-patterns into K groups, where each extracted sub-pattern is placed within the same group with other extracted sub-patterns based upon a distance value D that determines a degree of similarity between the sub-pattern and every other sub-pattern within the same group.
39 Citations
14 Claims
-
1. A system for mining for sub-patterns within a text data set, the system comprising:
-
a data source to store the text data set; and a processor configured with logic to; find a set of N frequently occurring sub-patterns within the text data set; extract the N sub-patterns from the data set; and cluster pairs of the extracted sub-patterns into K groups such that each pair of the extracted sub-patterns is placed within the same group with other pairs of the extracted sub-patterns based upon a respective distance value D of each pair of the extracted sub-patterns that determines a degree of similarity based upon a respective longest common substring between a respective first sub-pattern and a respective second sub-pattern of the each pair of the extracted sub-patterns within the same group and also based upon values associated with characters or symbols for the respective first sub-pattern and the respective second sub-pattern of the each pair of the extracted sub-patterns within the same group, wherein; the value of each of the characters or the symbols is related to a respective probability value of each of the characters or the symbols in the text data set; and the respective distance D of the each pair of the extracted sub-patterns is related to a sum of the values of each of the characters or the symbols of the respective first sub-pattern, a sum of the values of each of the characters or the symbols of the respective second sub-pattern, and a sum of the values of each of the characters or the symbols of a longest common substring between the respective first sub-pattern and the respective second sub-pattern of the each pair of the extracted sub-patterns. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A computer program product for mining for sub-patterns within a text data set, the computer program product comprising:
-
a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code configured to; find a set of N frequently occurring sub-patterns within the text data set; extract the N sub-patterns from the text data set; and cluster pairs of the extracted sub-patterns into K groups such that each pair of the extracted sub-patterns is placed within the same group with other pairs of the extracted sub-patterns based upon a respective distance value D of each pair of the extracted sub-patterns that determines a degree of similarity based upon a respective longest common substring between a respective first sub-pattern and a respective second sub-pattern of each pair of the extracted sub-patterns within the same group and also based upon values associated with characters or symbols for the respective first sub-pattern and the respective second sub-pattern of the each pair of the extracted sub-patterns within the same group, wherein the value of each of the characters or the symbols is related to a respective probability value of each of the characters or the symbols in the text data set; and the respective distance D of the each pair of the extracted sub-patterns is related to a sum of the values of each of the characters or the symbols of the respective first sub-pattern, a sum of the values of each of the characters or the symbols of the respective second sub-pattern, and a sum of the values of each of the characters or the symbols of a longest common substring between the respective first sub-pattern and the respective second sub-pattern of the each pair of the extracted sub-patterns. - View Dependent Claims (8, 9, 10, 11, 12, 13, 14)
-
Specification