System and method for performing configurable matching of similar data in a data repository

US 20070276844A1
Filed: 05/01/2006
Published: 11/29/2007
Est. Priority Date: 05/01/2006
Status: Active Grant

First Claim

Patent Images

1. A computer program product for adaptive matching of similar data in a data repository comprising:

a computer usable memory medium having computer readable program code embodied therein wherein said computer readable program code comprises a matching executable unit configured to;

present at least one field common to a first record and a second record wherein said at least one field is used to perform a match between said first record and said second record and wherein said at least one field is presented to a user;

obtain a first selected field and a second selected field from said at least one field wherein said first selected field and said second selected field is obtained from said user;

tokenize a first data entry in said first selected field for a first record to produce a first tokenized data entry;

tokenize a second data entry in said second selected field for said second record to produce a second tokenized data entry;

exclude at least one character from said first tokenized data entry for utilization in a match that involves said first field and said second field;

exclude at least one different character with respect to said at least one character from said second tokenized data entry for utilization in a match that involves said first field and said second field;

remove frequently used strings from said first tokenized data entry and from said second tokenized data entry;

normalize data from said first field and from said second field to cleanse strings;

accept a first list of tokens desired for a match to occur utilizing said first selected field;

accept a second list of tokens desired for a match to occur utilizing said second selected field;

assign weights to each token in said first list of tokens and each token in said second list of tokens;

calculate a score for a match through summation of said weights for each token that occurs in said first tokenized data entry and second record and for each token that occurs in said second tokenized data entry and said second record;

generate a group of similar records when said score is above a threshold; and

, display said group of similar records.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Adaptive matching of similar data in a data repository to determine if two or more data items are related in accordance with configurable criteria. Matches are adapted by learning and presenting appropriate match criteria based on previous user input. The system can merge the data items into one master data item, group similar items and perform further processing based on the result. The configurable match criteria presented to a user are adapted by the system based on previous interactions of the system with users. Matching is performed by selecting data items to match, removing frequently used strings, normalizing data, tokenizing multi-word data items, assigning weights to each token, calculating a score using the assigned weights, generating groups of similar records, assigning thresholds for match levels. Adapting choices of match criteria for a user based on past interaction allows for rapid match creation and match maintenance that optimizes data integrity across an enterprise.

62 Citations

View as Search Results

20 Claims

1. A computer program product for adaptive matching of similar data in a data repository comprising:
- a computer usable memory medium having computer readable program code embodied therein wherein said computer readable program code comprises a matching executable unit configured to;
  
  present at least one field common to a first record and a second record wherein said at least one field is used to perform a match between said first record and said second record and wherein said at least one field is presented to a user;
  
  obtain a first selected field and a second selected field from said at least one field wherein said first selected field and said second selected field is obtained from said user;
  
  tokenize a first data entry in said first selected field for a first record to produce a first tokenized data entry;
  
  tokenize a second data entry in said second selected field for said second record to produce a second tokenized data entry;
  
  exclude at least one character from said first tokenized data entry for utilization in a match that involves said first field and said second field;
  
  exclude at least one different character with respect to said at least one character from said second tokenized data entry for utilization in a match that involves said first field and said second field;
  
  remove frequently used strings from said first tokenized data entry and from said second tokenized data entry;
  
  normalize data from said first field and from said second field to cleanse strings;
  
  accept a first list of tokens desired for a match to occur utilizing said first selected field;
  
  accept a second list of tokens desired for a match to occur utilizing said second selected field;
  
  assign weights to each token in said first list of tokens and each token in said second list of tokens;
  
  calculate a score for a match through summation of said weights for each token that occurs in said first tokenized data entry and second record and for each token that occurs in said second tokenized data entry and said second record;
  
  generate a group of similar records when said score is above a threshold; and
  
  , display said group of similar records.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The computer program product of claim 1 wherein said computer readable program code is further configured to:
    - said exclude further configured to present a list of excluded characters most often excluded in matches involving said first field.
  - 3. The computer program product of claim 1 wherein said computer readable program code is further configured to:
    - said remove said frequently used strings further configured to present a list of frequently used strings most often removed in matches involving said first field.
  - 4. The computer program product of claim 1 wherein said computer readable program code is further configured to:
    - said normalize said data from said first field further configured to present a list of tokens most often normalized in matches involving said first field.
  - 5. The computer program product of claim 1 wherein said computer readable program code is further configured to:
    - said assign said weights to each token in said first list of tokens further configured to present a list of tokens and corresponding weight values most often chosen in matches involving said first field.
  - 6. The computer program product of claim 1 wherein said computer readable program code is further configured to:
    - accept input that signifies if said first list of tokens is required to match in sequential order.
  - 7. The computer program product of claim 1 wherein said computer readable program code is further configured to:
    - accept input that signifies if said first list of tokens is required to match in non-sequential order.
  - 8. The computer program product of claim 1 wherein said computer readable program code is further configured to:
    - accept a weight value associated with a field.
  - 9. The method of claim 1 further comprising:
    - present a list of tokens previously used by a user to define a match involving said first selected field.
  - 10. The method of claim 1 further comprising:
    - alter a list of tokens presented to a user to define a match involving said first selected field when said user selects said second selected field.

11. A computer program product for adaptive matching of similar data in a data repository comprising:
- a computer usable memory medium having computer readable program code embodied therein wherein said computer readable program code comprises a matching executable unit configured to;
  
  present at least one field common to a first record and a second record wherein said at least one field is used to perform a match between said first record and said second record and wherein said at least one field is presented to a user;
  
  obtain a first selected field and a second selected field from said at least one field wherein said first selected field and said second selected field is obtained from said user;
  
  tokenize a first data entry in said first selected field for a first record to produce a first tokenized data entry;
  
  tokenize a second data entry in said second selected field for said second record to produce a second tokenized data entry;
  
  exclude at least one character from said first tokenized data entry for utilization in a match that involves said first field and said second field;
  
  exclude at least one different character with respect to said at least one character from said second tokenized data entry for utilization in a match that involves said first field and said second field;
  
  remove frequently used strings from said first tokenized data entry and from said second tokenized data entry;
  
  normalize data from said first field and from said second field to cleanse strings;
  
  alter a list of tokens presented to a user to define a match involving said first selected field when said user selects said second selected field;
  
  accept a first list of tokens desired for a match to occur utilizing said first selected field;
  
  accept a second list of tokens desired for a match to occur utilizing said second selected field;
  
  assign weights to each token in said first list of tokens and each token in said second list of tokens;
  
  calculate a score for a match through summation of said weights for each token that occurs in said first tokenized data entry and second record and for each token that occurs in said second tokenized data entry and said second record;
  
  generate a group of similar records when said score is above a threshold; and
  
  , display said group of similar records.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
- - 12. The computer program product of claim 11 wherein said computer readable program code is further configured to:
    - said exclude further configured to present a list of excluded characters most often excluded in matches involving said first field.
  - 13. The computer program product of claim 11 wherein said computer readable program code is further configured to:
    - said remove said frequently used strings further configured to present a list of frequently used strings most often removed in matches involving said first field.
  - 14. The computer program product of claim 11 wherein said computer readable program code is further configured to:
    - said normalize said data from said first field further configured to present a list of tokens most often normalized in matches involving said first field.
  - 15. The computer program product of claim 11 wherein said computer readable program code is further configured to:
    - said assign said weights to each token in said first list of tokens further configured to present a list of tokens and corresponding weight values most often chosen in matches involving said first field.
  - 16. The computer program product of claim 11 wherein said computer readable program code is further configured to:
    - accept input that signifies if said first list of tokens is required to match in sequential order.
  - 17. The computer program product of claim 11 wherein said computer readable program code is further configured to:
    - accept input that signifies if said first list of tokens is required to match in non-sequential order.
  - 18. The computer program product of claim 11 wherein said computer readable program code is further configured to:
    - accept a weight value associated with a field.
  - 19. The computer program product of claim 11 further comprising:
    - present a list of tokens previously used by a user to define a match involving said first selected field.

20. A system for adaptive matching of similar data in a data repository comprising:
- means for presenting at least one field common to a first record and a second record wherein said at least one field is used to perform a match between said first record and said second record and wherein said at least one field is presented to a user;
  
  means for obtaining a first selected field and a second selected field from said at least one field wherein said first selected field and said second selected field is obtained from said user;
  
  means for tokenizing a first data entry in said first selected field for a first record to produce a first tokenized data entry;
  
  means for tokenizing a second data entry in said second selected field for said second record to produce a second tokenized data entry;
  
  means for excluding at least one character from said first tokenized data entry for utilization in a match that involves said first field and said second field;
  
  means for excluding at least one different character with respect to said at least one character from said second tokenized data entry for utilization in a match that involves said first field and said second field;
  
  means for removing frequently used strings from said first tokenized data entry and from said second tokenized data entry;
  
  means for normalizing data from said first field and from said second field to cleanse strings;
  
  means for accepting a first list of tokens desired for a match to occur utilizing said first selected field;
  
  means for accepting a second list of tokens desired for a match to occur utilizing said second selected field;
  
  means for assigning weights to each token in said first list of tokens and each token in said second list of tokens;
  
  means for calculating a score for a match through summation of said weights for each token that occurs in said first tokenized data entry and second record and for each token that occurs in said second tokenized data entry and said second record;
  
  means for generating a group of similar records when said score is above a threshold; and
  
  , means for displaying said group of similar records.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
SAP SE
Original Assignee
SAP AG (SAP SE)
Inventors
Cohen, Ronen, Segal, Anat

Granted Patent

US 7,542,973 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/24558   Binary matching operations

G06F 16/2458   Special types of queries, e...

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99936   Pattern matching access

Y10S 707/99937   Sorting

Y10S 707/99943   Generating database or data...

Y10S 707/99945   Object-oriented database st...

System and method for performing configurable matching of similar data in a data repository

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

62 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

System and method for performing configurable matching of similar data in a data repository

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

62 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others