×

Automatically mining patterns for rule based data standardization systems

  • US 10,095,780 B2
  • Filed: 02/07/2017
  • Issued: 10/09/2018
  • Est. Priority Date: 03/07/2012
  • Status: Active Grant
First Claim
Patent Images

1. A system for mining for sub-patterns within a text data set, the system comprising:

  • a data source to store the text data set; and

    a processor configured with logic to;

    find a set of N frequently occurring sub-patterns within the text data set;

    extract the N sub-patterns from the data set; and

    cluster pairs of the extracted sub-patterns into K groups such that each pair of the extracted sub-patterns is placed within the same group with other pairs of the extracted sub-patterns based upon a respective distance value D of each pair of the extracted sub-patterns that determines a degree of similarity based upon a respective longest common substring between a respective first sub-pattern and a respective second sub-pattern of the each pair of the extracted sub-patterns within the same group and also based upon values associated with characters or symbols for the respective first sub-pattern and the respective second sub-pattern of the each pair of the extracted sub-patterns within the same group, wherein;

    the value of each of the characters or the symbols is related to a respective probability value of each of the characters or the symbols in the text data set; and

    the respective distance D of the each pair of the extracted sub-patterns is related to a sum of the values of each of the characters or the symbols of the respective first sub-pattern, a sum of the values of each of the characters or the symbols of the respective second sub-pattern, and a sum of the values of each of the characters or the symbols of a longest common substring between the respective first sub-pattern and the respective second sub-pattern of the each pair of the extracted sub-patterns.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×