Automatically mining patterns for rule based data standardization systems

US 10,163,063 B2
Filed: 03/07/2012
Issued: 12/25/2018
Est. Priority Date: 03/07/2012
Status: Active Grant

First Claim

Patent Images

1. A system for mining sub-patterns within a text data set, the system comprising:

a data source to store the text data set; and

a processor configured with logic to;

find a set of N frequently occurring sub-patterns within the data set;

extract the N sub-patterns from the data set; and

cluster the extracted sub-patterns into K groups such that each extracted sub-pattern is placed within the same group with other extracted sub-patterns based upon a distance value D that determines a degree of similarity based upon a longest common substring between the sub-pattern and every other sub-pattern within the same group and also based upon values associated with characters or symbols for the sub-pattern and every other sub-pattern within the same group;

wherein the processor is configured to determine the distance value D between any two sub-patterns s₁and s₂of the N sub-patterns based upon the following equation;

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Computer program products and systems are provided for mining for sub-patterns within a text data set. The embodiments facilitate finding a set of N frequently occurring sub-patterns within the data set, extracting the N sub-patterns from the data set, and clustering the extracted sub-patterns into K groups, where each extracted sub-pattern is placed within the same group with other extracted sub-patterns based upon a distance value D that determines a degree of similarity between the sub-pattern and every other sub-pattern within the same group.

Citations

2 Claims

1. A system for mining sub-patterns within a text data set, the system comprising:
- a data source to store the text data set; and
  
  a processor configured with logic to;
  
  find a set of N frequently occurring sub-patterns within the data set;
  
  extract the N sub-patterns from the data set; and
  
  cluster the extracted sub-patterns into K groups such that each extracted sub-pattern is placed within the same group with other extracted sub-patterns based upon a distance value D that determines a degree of similarity based upon a longest common substring between the sub-pattern and every other sub-pattern within the same group and also based upon values associated with characters or symbols for the sub-pattern and every other sub-pattern within the same group;
  
  wherein the processor is configured to determine the distance value D between any two sub-patterns s₁and s₂of the N sub-patterns based upon the following equation;

2. A computer program product for mining for sub-patterns within a text data set, the computer program product comprising:
- a computer readable storage medium having computer readable program code embodied therewith, the computer readable program code configured to;
  
  find a set of N frequently occurring sub-patterns within the data set;
  
  extract the N sub-patterns from the data set; and
  
  cluster the extracted sub-patterns into K groups such that each extracted sub-pattern is placed within the same group with other extracted sub-patterns based upon a distance value D that determines a degree of similarity between the sub-pattern and every other sub-pattern within the same group;
  
  wherein the computer readable program code is configured to calculate the distance value D between any two sub-patterns s₁and s₂of the N sub-patterns based upon the following equation;

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Chaturvedi, Snigdha, Faruquie, Tanveer A, Karanam, Hima P., Mendelssohn, Marvin, Mohania, Mukesh K., Subramaniam, L. Venkata
Primary Examiner(s)
Arjomandi, Noosha

Application Number

US13/414,374
Publication Number

US 20130238610A1
Time in Patent Office

2,484 Days
Field of Search

707737, 707738, 707739, 707740, 707727, 707728
US Class Current
CPC Class Codes

G06F 16/334   Query execution G06F16/335 ...

G06F 16/35   Clustering; Classification

G06F 2216/03   Data mining

G06F 40/289   Phrasal analysis, e.g. fini...

G06Q 10/06   Resources, workflows, human...

G06Q 10/10   Office automation; Time man...

G06Q 30/02   Marketing; Price estimation...

Automatically mining patterns for rule based data standardization systems

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

2 Claims

Specification

Solutions

Use Cases

Quick Links

Automatically mining patterns for rule based data standardization systems

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

2 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links