Automatically mining patterns for rule based data standardization systems

US 10,095,780 B2
Filed: 02/07/2017
Issued: 10/09/2018
Est. Priority Date: 03/07/2012
Status: Active Grant

First Claim

Patent Images

1. A system for mining for sub-patterns within a text data set, the system comprising:

a data source to store the text data set; and

a processor configured with logic to;

find a set of N frequently occurring sub-patterns within the text data set;

extract the N sub-patterns from the data set; and

cluster pairs of the extracted sub-patterns into K groups such that each pair of the extracted sub-patterns is placed within the same group with other pairs of the extracted sub-patterns based upon a respective distance value D of each pair of the extracted sub-patterns that determines a degree of similarity based upon a respective longest common substring between a respective first sub-pattern and a respective second sub-pattern of the each pair of the extracted sub-patterns within the same group and also based upon values associated with characters or symbols for the respective first sub-pattern and the respective second sub-pattern of the each pair of the extracted sub-patterns within the same group, wherein;

the value of each of the characters or the symbols is related to a respective probability value of each of the characters or the symbols in the text data set; and

the respective distance D of the each pair of the extracted sub-patterns is related to a sum of the values of each of the characters or the symbols of the respective first sub-pattern, a sum of the values of each of the characters or the symbols of the respective second sub-pattern, and a sum of the values of each of the characters or the symbols of a longest common substring between the respective first sub-pattern and the respective second sub-pattern of the each pair of the extracted sub-patterns.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Computer program products and systems are provided for mining for sub-patterns within a text data set. The embodiments facilitate finding a set of N frequently occurring sub-patterns within the data set, extracting the N sub-patterns from the data set, and clustering the extracted sub-patterns into K groups, where each extracted sub-pattern is placed within the same group with other extracted sub-patterns based upon a distance value D that determines a degree of similarity between the sub-pattern and every other sub-pattern within the same group.

39 Citations

14 Claims

1. A system for mining for sub-patterns within a text data set, the system comprising:
- a data source to store the text data set; and
  
  a processor configured with logic to;
  
  find a set of N frequently occurring sub-patterns within the text data set;
  
  extract the N sub-patterns from the data set; and
  
  cluster pairs of the extracted sub-patterns into K groups such that each pair of the extracted sub-patterns is placed within the same group with other pairs of the extracted sub-patterns based upon a respective distance value D of each pair of the extracted sub-patterns that determines a degree of similarity based upon a respective longest common substring between a respective first sub-pattern and a respective second sub-pattern of the each pair of the extracted sub-patterns within the same group and also based upon values associated with characters or symbols for the respective first sub-pattern and the respective second sub-pattern of the each pair of the extracted sub-patterns within the same group, wherein;
  
  the value of each of the characters or the symbols is related to a respective probability value of each of the characters or the symbols in the text data set; and
  
  the respective distance D of the each pair of the extracted sub-patterns is related to a sum of the values of each of the characters or the symbols of the respective first sub-pattern, a sum of the values of each of the characters or the symbols of the respective second sub-pattern, and a sum of the values of each of the characters or the symbols of a longest common substring between the respective first sub-pattern and the respective second sub-pattern of the each pair of the extracted sub-patterns.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The system of claim 1, wherein the processor is configured to determine the respective distance value D between any two sub-patterns s₁and s₂of the N sub-patterns based upon the following equation:
  - 3. The system of claim 1, wherein the processor is further configured with logic to:
    - assign a number ranking to each of the K groups based upon a frequency of occurrence of a sub-pattern within the text data set for each group.
  - 4. The system of claim 3, wherein the processor is configured to designate a first group including a sub-pattern having a highest frequency in relation to all other sub-patterns in the first group and a second group including a sub-pattern having a highest frequency in relation to all other sub-patterns in the second group, the highest frequency associated with the first group being greater than the highest frequency associated with the second group, and the first group being ranked with a lower number than a ranking number of the second group.
  - 5. The system of claim 1, wherein the processor is further configured with logic to select a representative sub-pattern from each of the K groups.
  - 6. The system of claim 1, wherein:
    - the processor being configured with the logic to cluster the pairs of the extracted sub-patterns into the K subgroups further comprises the processor being configured with the logic to cluster the pairs of the extracted sub-patterns into the K groups such that the each pair of the extracted sub-patterns is placed in the same group with the other pairs of the extracted sub-patterns based upon a comparison of the respective distance D to a threshold value, andthe processor is further configured with the logic to assign a number ranking to each of the K groups based upon a frequency of occurrence of a sub-pattern within the text data set for each group, wherein the processor is configured to designate a first group including a sub-pattern having a highest frequency in relation to all other sub-patterns in the first group and a second group including a sub-pattern having a highest frequency in relation to all other sub-patterns in the second group, the highest frequency associated with the first group being greater than the highest frequency associated with the second group, and the first group being ranked with a lower number than a ranking number of the second group.

7. A computer program product for mining for sub-patterns within a text data set, the computer program product comprising:
- a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code configured to;
  
  find a set of N frequently occurring sub-patterns within the text data set;
  
  extract the N sub-patterns from the text data set; and
  
  cluster pairs of the extracted sub-patterns into K groups such that each pair of the extracted sub-patterns is placed within the same group with other pairs of the extracted sub-patterns based upon a respective distance value D of each pair of the extracted sub-patterns that determines a degree of similarity based upon a respective longest common substring between a respective first sub-pattern and a respective second sub-pattern of each pair of the extracted sub-patterns within the same group and also based upon values associated with characters or symbols for the respective first sub-pattern and the respective second sub-pattern of the each pair of the extracted sub-patterns within the same group, whereinthe value of each of the characters or the symbols is related to a respective probability value of each of the characters or the symbols in the text data set; and
  
  the respective distance D of the each pair of the extracted sub-patterns is related to a sum of the values of each of the characters or the symbols of the respective first sub-pattern, a sum of the values of each of the characters or the symbols of the respective second sub-pattern, and a sum of the values of each of the characters or the symbols of a longest common substring between the respective first sub-pattern and the respective second sub-pattern of the each pair of the extracted sub-patterns.
- View Dependent Claims (8, 9, 10, 11, 12, 13, 14)
- - 8. The computer program product of claim 7, wherein the N frequently occurring sub-patterns are the N most frequently occurring sub-patterns within the text data set.
  - 9. The computer program product of claim 7, wherein each sub-pattern has a selected length of consecutive characters or symbols within the text data set that is no greater than a selected number.
  - 10. The computer program product of claim 7, wherein the computer readable program code is further configured to:
    - assign a number ranking to each of the K groups based upon a frequency of occurrence of a sub-pattern within the text data set for each group.
  - 11. The computer program product of claim 10, wherein a first group includes a sub-pattern having a highest frequency in relation to all other sub-patterns in the first group and a second group includes a sub-pattern having a highest frequency in relation to all other sub-patterns in the second group, the highest frequency associated with the first group is greater than the highest frequency associated with the second group, and the first group is ranked with a lower number than a ranking number of the second group.
  - 12. The computer program product of claim 7, wherein the computer readable program code is further configured to:
    - select a representative sub-pattern from each of the K groups.
  - 13. The computer program product of claim 12, wherein the sub-pattern in each group of the K groups having a highest determined frequency of occurrence in relation to determined frequencies of occurrence for all other sub-patterns within each group is selected as the representative sub-pattern for each group.
  - 14. The computer program product of claim 7, wherein:
    - the computer readable program code being configured to cluster the pairs of the extracted sub-patterns into the K groups further comprises the computer readable program code being configured to cluster the pairs of the extracted sub-patterns into the K groups such that the each pair of the extracted sub-patterns is placed in the same group with the other pairs of the extracted sub-patterns based upon a comparison of the respective distance D to a threshold value, andthe computer readable program code is further configured to assign a number ranking to each of the K groups based upon a frequency of occurrence of a sub-pattern within the data set for each group, wherein the processor is configured to designate a first group including a sub-pattern having a highest frequency in relation to all other sub-patterns in the first group and a second group including a sub-pattern having a highest frequency in relation to all other sub-patterns in the second group, the highest frequency associated with the first group being greater than the highest frequency associated with the second group, and the first group being ranked with a lower number than a ranking number of the second group.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Chaturvedi, Snigdha, Faruquie, Tanveer A., Karanam, Hima P., Mendelssohn, Marvin, Mohania, Mukesh K., Subramaniam, L. Venkata
Primary Examiner(s)
Arjomandi, Noosha

Application Number

US15/426,438
Publication Number

US 20170147688A1
Time in Patent Office

609 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/334   Query execution G06F16/335 ...

G06F 16/35   Clustering; Classification

G06F 2216/03   Data mining

G06F 40/289   Phrasal analysis, e.g. fini...

G06Q 10/06   Resources, workflows, human...

G06Q 10/10   Office automation; Time man...

G06Q 30/02   Marketing; Price estimation...

Automatically mining patterns for rule based data standardization systems

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

39 Citations

14 Claims

Specification

Solutions

Use Cases

Quick Links

Automatically mining patterns for rule based data standardization systems

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

39 Citations

14 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links