Systems and methods for manipulation of inexact semi-structured data

US 8,224,830 B2
Filed: 03/17/2006
Issued: 07/17/2012
Est. Priority Date: 03/19/2005
Status: Expired due to Fees

First Claim

Patent Images

1. A method for reducing a set of strings to approximately match to a first string by determining an edit distance between the first string and the set of strings is within a predetermined threshold, the method comprising:

(a) receiving, by a device, a request to approximately match a first string with a set of strings using a predetermined edit distance;

(b) generating, by a device, a difference histogram comprising a distribution of a difference in a first number of occurrences of each character of a character set in the first string of the request and a second number of occurrences of each character of the character set in a second string of the set of strings, by incrementing each cell in the difference histogram corresponding to each character in the first string by a positive value and decrementing each cell in the difference histogram corresponding to each character set in the second string by a negative value;

(c) determining, by a device, via the difference histogram that a first sum of values across a plurality of cells of the difference histogram is greater than a predetermined threshold and that a second sum of negative values across a second plurality of cells of the difference histogram is less than a negative of the predetermined threshold; and

(d) identifying, by the device, the second string as having an edit distance from the first string greater than the predetermined edit distance in response to the determination.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The data constraint framework solution of the present invention addresses data quality issues by standardizing, verifying, matching, consolidating and merging data records using powerful inexact matching logic and search reduction technologies. The data conditioning framework uses these technologies to more efficiently condition data to improve the quality of data and/or resolve quality data issues such as incomplete, inaccurate and duplicate data records. For example, the data conditioning framework is used to “cleanse” incorrect, incomplete and duplicate data from a data source, such as an information system. The data conditioning framework uses the following approximate searching and matching techniques to improve the efficiency of the approximate matching, reduce the search space for approximate matching, and improve the speed of executing approximate searches and matches: 1) inexact trimmed matching, 2) adaptive search ordering, 3) cascading search space reduction, 4) tiered and metric indexing, and 5) domain knowledge matching.

24 Citations

19 Claims

1. A method for reducing a set of strings to approximately match to a first string by determining an edit distance between the first string and the set of strings is within a predetermined threshold, the method comprising:
- (a) receiving, by a device, a request to approximately match a first string with a set of strings using a predetermined edit distance;
  
  (b) generating, by a device, a difference histogram comprising a distribution of a difference in a first number of occurrences of each character of a character set in the first string of the request and a second number of occurrences of each character of the character set in a second string of the set of strings, by incrementing each cell in the difference histogram corresponding to each character in the first string by a positive value and decrementing each cell in the difference histogram corresponding to each character set in the second string by a negative value;
  
  (c) determining, by a device, via the difference histogram that a first sum of values across a plurality of cells of the difference histogram is greater than a predetermined threshold and that a second sum of negative values across a second plurality of cells of the difference histogram is less than a negative of the predetermined threshold; and
  
  (d) identifying, by the device, the second string as having an edit distance from the first string greater than the predetermined edit distance in response to the determination.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, comprising skipping an approximate match of the first string to the second string in response to the determination.
  - 3. The method of claim 1, comprising determining the first count and the second count are less than the predetermined threshold and performing an approximate match of the first string to the second string using the predetermined edit distance.
  - 4. The method of claim 1, comprising terminating execution of an approximate match of the first string with a string of the set of strings upon an edit distance of the approximate match reaching the predetermined threshold.
  - 5. The method of claim 1, wherein step(a) comprises determining the first string and the second string are the same and identifying the second string as having an edit distance of zero.
  - 6. The method of claim 1, wherein one of the first string or the second string identify one of a person, an institute or a location.
  - 7. The method of claim 1, wherein one of the first string or the second string represent data of one of the following systems:
    - accounting, manufacturing, customer relationship management, enterprise resource planning, product data management, product lifecycle management, supply chain management, bioinformatics system or lab information management system.
  - 8. The method of claim 1, wherein a character of one of the first string of the second string comprises one of a letter, a digit, or a symbol.
  - 9. The method of claim 1, comprising setting one of the predetermined edit distance or the predetermined threshold based on a percentage of number of characters of the first string.

10. A system for reducing a set of strings to approximately match to a first string by determining an edit distance between the first string and the set of strings is within a predetermined threshold, the system comprising:
- a device receiving a request to approximately match a first string with a set of strings using a predetermined edit distance;
  
  an approximate matching engine executing on a processor of the device, generating a difference histogram comprising a distribution of a difference in a first number of occurrences of each character of a character set in the first string of the request and a second number of occurrences of each character of the character set in a second string of the set of strings, by incrementing each cell in the difference histogram corresponding to each character in the first string by a positive value and decrementing each cell in the difference histogram corresponding to each character set in the second string by a negative value; and
  
  and wherein the approximate matching engine identifies the second string as having an edit distance from the first string greater than the predetermined edit distance in response to determining that a first sum of values across a plurality of cells of the difference histogram is greater than a predetermined threshold and that a second sum of negative values across a second plurality of cells of the difference histogram is less than a negative of the predetermined threshold.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- - 11. The system of claim 10, wherein the approximate matching engine skips an approximate match of the first string to the second string in response to the determination.
  - 12. The system of claim 10, wherein the approximate matching engine determines the first count and the second count are less than the predetermined threshold and performs an approximate match of the first string to the second string using the predetermined edit distance.
  - 13. The system of claim 10, wherein the approximate matching engine terminates execution of an approximate match of the first string with a string of the set of strings upon an edit distance of the approximate match reaching the predetermined threshold.
  - 14. The system of claim 10, wherein the approximate matching engine determines the first string and the second string are the same and identifying the second string as having an edit distance of zero.
  - 15. The system of claim 10, wherein one of the first string or the second string identify one of a person, an institute or a location.
  - 16. The system of claim 10, wherein one of the first string or the second string represent data of one of the following systems:
    - accounting, manufacturing, customer relationship management, enterprise resource planning, product data management, product lifecycle management, supply chain management, bioinformatics system or lab information management system.
  - 17. The system of claim 10, wherein a character of one of the first string of the second string comprises one of a letter, a digit, or a symbol.
  - 18. The system of claim 10, wherein one of the predetermined edit distance or the predetermined threshold is set based on a percentage of number of characters of the first string.

19. A method for reducing a set of strings to approximately match to a string, the method comprising:
- (a) receiving, by a device, a request to approximately match a first string to a plurality of strings;
  
  (b) determining, by the device, a first number of occurrences of each character of a character set in the first string;
  
  (c) determining, by the device, a second number of occurrences of each character of the character set in a second string of the plurality of strings;
  
  (d) generating, by the device based on the first number of occurrences minus the second number of occurrences for each character in the character set, a difference histogram comprising a difference in occurrence of each character in the character set between the first string and the second string by incrementing each cell in the difference histogram corresponding to each character in the first string by a positive value and decrementing each cell in the difference histogram corresponding to each character set in the second string by a negative value; and
  
  (e) skipping, by the device approximately matching the first string to the second string, responsive to determining via the difference histogram that a first sum of values across a plurality of cells of the difference histogram is less than a predetermined threshold and that a sum of negative values across a second plurality of cells of the difference histogram is less than a negative of the predetermined threshold.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
ActivePrime, Inc.
Original Assignee
ActivePrime, Inc.
Inventors
Bidlack, Clint
Primary Examiner(s)
EHICHIOYA, IRETE FRED

Application Number

US11/908,885
Publication Number

US 20090234826A1
Time in Patent Office

2,314 Days
Field of Search

707/609, 707/705, 707/958, 707/802
US Class Current

707/758
CPC Class Codes

G06F 16/215   Improving data quality; Dat...

G06F 16/24556   Aggregation; Duplicate elim...

G06F 16/90344   by using string matching te...

Systems and methods for manipulation of inexact semi-structured data

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

24 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for manipulation of inexact semi-structured data

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

24 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links