AUTOMATICALLY MINING PATTERNS FOR RULE BASED DATA STANDARDIZATION SYSTEMS
First Claim
Patent Images
1. A system for mining for sub-patterns within a text data set, the system comprising:
- a data source to store the text data set; and
a processor configured with logic to;
find a set of N frequently occurring sub-patterns within the text data set;
extract the N sub-patterns from the data set; and
cluster pairs of the extracted sub-patterns into K groups such that each pair of the extracted sub-patterns is placed within the same group with other pairs of the extracted sub-patterns based upon a respective distance value D of each pair of the extracted sub-patterns that determines a degree of similarity based upon a respective longest common substring between a respective first sub-pattern and a respective second sub-pattern of the each pair of the extracted sub-patterns within the same group and also based upon values associated with characters or symbols for the respective first sub-pattern and the respective second sub-pattern of the each pair of the extracted sub-patterns within the same group, whereinthe value of each of the characters or the symbols is related to a respective frequency of occurrence of each of the characters or the symbols in the text data set such that a character or a symbol having a frequency of occurrence that is higher than a frequency of occurrence of a second character or a second symbol has a lower value associated therewith than a value associated with the second character or the second symbol.
1 Assignment
0 Petitions
Accused Products
Abstract
Computer program products and systems are provided for mining for sub-patterns within a text data set. The embodiments facilitate finding a set of N frequently occurring sub-patterns within the data set, extracting the N sub-patterns from the data set, and clustering the extracted sub-patterns into K groups, where each extracted sub-pattern is placed within the same group with other extracted sub-patterns based upon a distance value D that determines a degree of similarity between the sub-pattern and every other sub-pattern within the same group.
-
Citations
14 Claims
-
1. A system for mining for sub-patterns within a text data set, the system comprising:
-
a data source to store the text data set; and a processor configured with logic to; find a set of N frequently occurring sub-patterns within the text data set; extract the N sub-patterns from the data set; and cluster pairs of the extracted sub-patterns into K groups such that each pair of the extracted sub-patterns is placed within the same group with other pairs of the extracted sub-patterns based upon a respective distance value D of each pair of the extracted sub-patterns that determines a degree of similarity based upon a respective longest common substring between a respective first sub-pattern and a respective second sub-pattern of the each pair of the extracted sub-patterns within the same group and also based upon values associated with characters or symbols for the respective first sub-pattern and the respective second sub-pattern of the each pair of the extracted sub-patterns within the same group, wherein the value of each of the characters or the symbols is related to a respective frequency of occurrence of each of the characters or the symbols in the text data set such that a character or a symbol having a frequency of occurrence that is higher than a frequency of occurrence of a second character or a second symbol has a lower value associated therewith than a value associated with the second character or the second symbol. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A computer program product for mining for sub-patterns within a text data set, the computer program product comprising:
-
a computer readable storage medium having computer readable program code embodied therewith, the computer readable program code configured to; find a set of N frequently occurring sub-patterns within the text data set; extract the N sub-patterns from the text data set; and cluster pairs of the extracted sub-patterns into K groups such that each pair of the extracted sub-patterns is placed within the same group with other pairs of the extracted sub-patterns based upon a respective distance value D of each pair of the extracted sub-patterns that determines a degree of similarity based upon a respective longest common substring between a respective first sub-pattern and a respective second sub-pattern of each pair of the extracted sub-patterns within the same group and also based upon values associated with characters or symbols for the respective first sub-pattern and the respective second sub-pattern of the each pair of the extracted sub-patterns within the same group, wherein the value of each of the characters or the symbols is related to a respective frequency of occurrence of each of the characters or the symbols in the text data set such that a character or a symbol having a frequency of occurrence that is higher than a frequency of occurrence of a second character or a second symbol has a lower value associated therewith than a value associated with the second character or the second symbol. - View Dependent Claims (8, 9, 10, 11, 12, 13, 14)
-
Specification