Data de-duplication
First Claim
Patent Images
1. A processor-implemented method for generating masks for data de-duplication from entity eponym data fields in a given set of data records, said data records each including an entity location data field, the method comprising:
- for each data record, splitting each entity eponym data field into a corresponding prefix-suffix combination, and for each prefix, a processor computing a tally of distinct entity locations, and for each prefix and entity location combination, the processor computing a tally of distinct suffixes; and
setting, by the processor, a threshold boundary wherein a prefix is defined as one of said masks when one or more of the tallies are indicative of different eponyms signifying a particular entity, wherein the one mask enables a particular data record to be matched to the particular entity by ignoring a portion of the particular data record, wherein said de-duplication involves matching each data record representing a specific activity to the particular entity of a plurality of known entities such that duplication of entities is reduced in a database of said plurality of known entities.
2 Assignments
0 Petitions
Accused Products
Abstract
Generating masks for de-duplication in a database where distributed entities provide activity data for said database. Determining from activity input data which entities add variable data to a given data field. Generating a list of the masks which effectively remove the variable data portion in the field. Consolidating input data using the generated masks.
-
Citations
21 Claims
-
1. A processor-implemented method for generating masks for data de-duplication from entity eponym data fields in a given set of data records, said data records each including an entity location data field, the method comprising:
-
for each data record, splitting each entity eponym data field into a corresponding prefix-suffix combination, and for each prefix, a processor computing a tally of distinct entity locations, and for each prefix and entity location combination, the processor computing a tally of distinct suffixes; and setting, by the processor, a threshold boundary wherein a prefix is defined as one of said masks when one or more of the tallies are indicative of different eponyms signifying a particular entity, wherein the one mask enables a particular data record to be matched to the particular entity by ignoring a portion of the particular data record, wherein said de-duplication involves matching each data record representing a specific activity to the particular entity of a plurality of known entities such that duplication of entities is reduced in a database of said plurality of known entities. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A processor-implemented method for partitioning a plurality of data packets in a database such that duplication of data groups is minimized, the method comprising:
-
selecting a primary identifier data field and a secondary identifier data field for each data packet that represents a corresponding activity; for all data packets having a non-unique primary identifier data field, using heuristic procedures for splitting each primary identifier data into at least one prefix-suffix combination; for each prefix, counting a first tally of how many distinct secondary identifier data fields occurs, and counting a second tally of how many distinct secondary identifier data fields occur with a single suffix, and for each prefix and each secondary identifier data field matched thereto, counting a third tally of how many distinct suffixes occur; based on said first tally, said second tally and said third tally generating masks representative of prefixes applicable to said data packets having a non-unique primary identifier data field such that application of said masks assigns data packets having a non-unique primary identifier data field to associated common entities defined thereby, wherein application of said masks provides cleaning of the data packets; and filing each of said data packets into a single file assigned to respective said associated common entities defined. - View Dependent Claims (13, 14, 15, 16)
-
-
17. A processor-implemented method of data de-duplication comprising:
-
receiving, by a processor, a periodic log of transactions representing corresponding activities associated with entities, each transaction represented by a data string including at least a name field and another identifier field; selecting, by the processor, unique representative samples of said transactions; for each of said samples, the processor dissecting each name field into a corresponding prefix and suffix combination, and for each prefix and each another identifier combination, the processor counting a number of distinct suffixes and storing a tally therefor; and generating, by the processor, a mask from a specific prefix when the specific prefix meets a predefined decision criteria which is a function of said tally, wherein the mask is applicable to the log of transactions to enable at least some of the data strings to be matched to a particular entity name by ignoring variable portions of the at least some data strings such that duplication of entities is reduced. - View Dependent Claims (18, 19, 20)
-
-
21. A computer memory containing instructions that when executed cause a computer to:
-
store a given set of data records representing activities for a given set of entities, each of said data records having discrete data fields including an entity identification field and an entity location field; split each entity identification field into a corresponding prefix-suffix combination; for each prefix, compute a tally of distinct entity locations; for each prefix and entity location field combination, compute a tally of distinct suffixes therefor; set a threshold boundary wherein a prefix is defined as one of said masks when one or more of the tallies is indicative of different entity identification strings in entity identification fields signifying a single one of said entities; and apply said masks to said given set of data records such that each record is assigned to a corresponding one of said given entities, wherein applying the masks provides cleaning of the data records.
-
Specification