Data de-duplication

US 7,200,604 B2
Filed: 02/17/2004
Issued: 04/03/2007
Est. Priority Date: 02/17/2004
Status: Active Grant

First Claim

Patent Images

1. A processor-implemented method for generating masks for data de-duplication from entity eponym data fields in a given set of data records, said data records each including an entity location data field, the method comprising:

for each data record, splitting each entity eponym data field into a corresponding prefix-suffix combination, and for each prefix, a processor computing a tally of distinct entity locations, and for each prefix and entity location combination, the processor computing a tally of distinct suffixes; and

setting, by the processor, a threshold boundary wherein a prefix is defined as one of said masks when one or more of the tallies are indicative of different eponyms signifying a particular entity, wherein the one mask enables a particular data record to be matched to the particular entity by ignoring a portion of the particular data record, wherein said de-duplication involves matching each data record representing a specific activity to the particular entity of a plurality of known entities such that duplication of entities is reduced in a database of said plurality of known entities.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Generating masks for de-duplication in a database where distributed entities provide activity data for said database. Determining from activity input data which entities add variable data to a given data field. Generating a list of the masks which effectively remove the variable data portion in the field. Consolidating input data using the generated masks.

160 Citations

21 Claims

1. A processor-implemented method for generating masks for data de-duplication from entity eponym data fields in a given set of data records, said data records each including an entity location data field, the method comprising:
- for each data record, splitting each entity eponym data field into a corresponding prefix-suffix combination, and for each prefix, a processor computing a tally of distinct entity locations, and for each prefix and entity location combination, the processor computing a tally of distinct suffixes; and
  
  setting, by the processor, a threshold boundary wherein a prefix is defined as one of said masks when one or more of the tallies are indicative of different eponyms signifying a particular entity, wherein the one mask enables a particular data record to be matched to the particular entity by ignoring a portion of the particular data record, wherein said de-duplication involves matching each data record representing a specific activity to the particular entity of a plurality of known entities such that duplication of entities is reduced in a database of said plurality of known entities.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method as set forth in claim 1, said setting the threshold boundary further comprising:
    - setting the threshold boundary wherein the prefix is defined as the one of said masks when one or more of the tallies indicate said entity eponym data fields include variable data.
  - 3. The method as set forth in claim 1, said setting the threshold boundary further comprising:
    - setting the threshold boundary wherein the prefix is defined as the one of said masks when the tally of distinct suffixes is indicative of suffixes being information other than entity identity.
  - 4. The method as set forth in claim 1, said setting the threshold boundary further comprising:
    - setting the threshold boundary where a ratio of the tally for said distinct suffixes to the tally for distinct entity locations is indicative of information other than entity identity.
  - 5. The method as set forth in claim 1 further comprising:
    - applying an override function to ignore the one mask based on a characteristic of a data record.
  - 6. The method as set forth in claim 1 further comprising:
    - prior to said splitting, creating a reduced data records sub-set by eliminating records having a unique entity eponym and entity location data pair.
  - 7. The method as set forth in claim 1 further comprising:
    - generating a display showing a graph having points each representing a pair of a prefix and entity location as a function of a number of distinct suffixes and a number of distinct entity locations.
  - 8. The method as set forth in claim 1 wherein said masks are generated as rules for ignoring variable data portions of the entity eponym data fields and assigning a respective data record therefor to said database based on a non-variable data portion of the corresponding entity eponym data field.
  - 9. The method as set forth in claim 8 further comprising:
    - maintaining said database by periodic application of said rules to a different set of data records to be added to said database.
  - 10. The method as set forth in claim 1, wherein the data records comprise business transaction records, and wherein the particular entity comprises a merchant.
  - 11. The method as set forth in claim 1, further comprising applying the one mask made up of the prefix to a new set of data records to assign at least some of the new set of data records to the particular entity.

12. A processor-implemented method for partitioning a plurality of data packets in a database such that duplication of data groups is minimized, the method comprising:
- selecting a primary identifier data field and a secondary identifier data field for each data packet that represents a corresponding activity;
  
  for all data packets having a non-unique primary identifier data field, using heuristic procedures for splitting each primary identifier data into at least one prefix-suffix combination;
  
  for each prefix, counting a first tally of how many distinct secondary identifier data fields occurs, and counting a second tally of how many distinct secondary identifier data fields occur with a single suffix, and for each prefix and each secondary identifier data field matched thereto, counting a third tally of how many distinct suffixes occur;
  
  based on said first tally, said second tally and said third tally generating masks representative of prefixes applicable to said data packets having a non-unique primary identifier data field such that application of said masks assigns data packets having a non-unique primary identifier data field to associated common entities defined thereby, wherein application of said masks provides cleaning of the data packets; and
  
  filing each of said data packets into a single file assigned to respective said associated common entities defined.
- View Dependent Claims (13, 14, 15, 16)
- - 13. The method as set forth in claim 12 wherein said primary identifier data field is an intended unique entity name data field.
  - 14. The method as set forth in claim 12 wherein said masks are generated to merge common entity name prefixes.
  - 15. The method as set forth in claim 12 wherein said secondary identifier data field is a postal code data field.
  - 16. The method as set forth in claim 12 further comprising:
    - retaining said masks as rules for cleaning dirty data portions of a data field of each data packet by removing variable data segments therefrom.

17. A processor-implemented method of data de-duplication comprising:
- receiving, by a processor, a periodic log of transactions representing corresponding activities associated with entities, each transaction represented by a data string including at least a name field and another identifier field;
  
  selecting, by the processor, unique representative samples of said transactions;
  
  for each of said samples, the processor dissecting each name field into a corresponding prefix and suffix combination, and for each prefix and each another identifier combination, the processor counting a number of distinct suffixes and storing a tally therefor; and
  
  generating, by the processor, a mask from a specific prefix when the specific prefix meets a predefined decision criteria which is a function of said tally, wherein the mask is applicable to the log of transactions to enable at least some of the data strings to be matched to a particular entity name by ignoring variable portions of the at least some data strings such that duplication of entities is reduced.
- View Dependent Claims (18, 19, 20)
- - 18. The method as set forth in claim 17 wherein for each said prefix, counting prefix-another identifier combinations and storing a first tally therefor and counting prefix-distinct another identifier combinations and storing a second tally therefor, such that said predefined decision criteria is a function of said tallies.
  - 19. The method as set forth in claim 17, wherein the transactions comprise business transactions, and the entity name is a name of a merchant.
  - 20. The method as set forth in claim 19, further comprising the processor applying the mask to the data strings to consolidate transactions associated with the merchant.

21. A computer memory containing instructions that when executed cause a computer to:
- store a given set of data records representing activities for a given set of entities, each of said data records having discrete data fields including an entity identification field and an entity location field;
  
  split each entity identification field into a corresponding prefix-suffix combination;
  
  for each prefix, compute a tally of distinct entity locations;
  
  for each prefix and entity location field combination, compute a tally of distinct suffixes therefor;
  
  set a threshold boundary wherein a prefix is defined as one of said masks when one or more of the tallies is indicative of different entity identification strings in entity identification fields signifying a single one of said entities; and
  
  apply said masks to said given set of data records such that each record is assigned to a corresponding one of said given entities, wherein applying the masks provides cleaning of the data records.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hewlett Packard Enterprise Development LP (Hewlett-Packard Enterprise Company)
Original Assignee
Hewlett-Packard Development Company, L.P. (HP Inc.)
Inventors
Zhang, Bin, Forman, George Henry, Safai, Fereydoon
Primary Examiner(s)
LEWIS, CHERYL RENEA

Application Number

US10/780,235
Publication Number

US 20050182780A1
Time in Patent Office

1,141 Days
Field of Search

707/6, 707/100, 707/101, 707/200, 707/202, 707/206, 705/64, 705/75, 705 38- 39
US Class Current

707/692
CPC Class Codes

G06F 16/215   Improving data quality; Dat...

G06F 16/24556   Aggregation; Duplicate elim...

Y10S 707/99942   Manipulating data structure...

Data de-duplication

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

160 Citations

21 Claims

Specification

Use Cases

Quick Links

Others

Data de-duplication

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

160 Citations

21 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others