Grouping and duplicate removal method in a database
First Claim
1. A grouping method comprising the steps of:
- determining whether there is possibility or not for existence of at least another record having the identical value of a column for grouping of the relevant record for each record by executing a predetermined first process to a value of at least a part of one or more predetermined columns for the grouping of each record of a plurality of records of a record list included in the database;
determining respectively a part of a plurality of records which is determined, by said determination, not to have possibility for existence of other records having a value of the same grouping column as a records belonging to one group; and
classifying, by a second process, a plurality of the other records other than a part of said records among a plurality of records of said record list into a plurality of groups so that a plurality of records having the identical value of said columns for grouping belong to the same group.
1 Assignment
0 Petitions
Accused Products
Abstract
In order to realize high speed process for grouping the records having the identical values of one or more columns of the input list, the input list is canned, a hash value is generated using a hash function in which a value of the column as the non-vacant partial aggregation of the columns for the grouping is used as the argument, and it is determined whether two or more records having the hash values exist or not. The input list is scanned again and the calculating process of the aggregation columns is immediately executed for the records having the hash value for which it is determined by the first scanning that there is only one record having the identical hash value, the result of such calculating process is output and the records are defined as the input of the ordinary grouping process for the records which are determined to exist as the two or more records.
-
Citations
18 Claims
-
1. A grouping method comprising the steps of:
-
determining whether there is possibility or not for existence of at least another record having the identical value of a column for grouping of the relevant record for each record by executing a predetermined first process to a value of at least a part of one or more predetermined columns for the grouping of each record of a plurality of records of a record list included in the database;
determining respectively a part of a plurality of records which is determined, by said determination, not to have possibility for existence of other records having a value of the same grouping column as a records belonging to one group; and
classifying, by a second process, a plurality of the other records other than a part of said records among a plurality of records of said record list into a plurality of groups so that a plurality of records having the identical value of said columns for grouping belong to the same group. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
generating a hash value, to each record among a plurality of records of the record list included in said database, by the hash function in which a value of at least a part of columns of one or more predetermined grouping columns of each record is used as an argument; and
determining whether there is at least another records having the same hash value as that generated for each record or not depending on a plurality of hash values generated for a plurality of said records;
wherein said determining step further comprises a step of determining that a plurality of partial records which are determined respectively not to allow the other records having the same hash value to exist by said determination as the records respectively belonging to one group.
-
-
3. A grouping method according to claim 2, wherein said determining step comprises the steps of:
-
updating, in relation to the generated hash value, counted data for identifying, each time when said hash value is generated for each record, whether the number of records which has generated a plurality of hash value to be generated by said hash function is 0 or 1 or 2 or larger;
executing again said hash value generating step to a plurality of said records after said update is executed to a plurality of said records of said record list; and
determining, based on said counted data, whether the number of records having generated the hash value generated again is 1 or not.
-
-
4. A grouping method according to claim 3, wherein said counted data includes two bits for counting whether the number of records having generated such hash value is 0 or 1 or 2 or larger corresponding to each value among a plurality of hash values which may be generated by said hash function.
-
5. A grouping method according to claim 3, wherein said counted data expresses the number of records 0, 1, 2 or larger having generated the hash value with the natural numbers such as 0×
- (3i), 1×
(3i), 2×
(3i) when the hash value is the i-th hash value corresponding to each value of a plurality of hash values which may be generated by said hash function and a binary number expressing the total sum of such natural numbers corresponding to a plurality of said hash values; and3n is smaller than 22k and n>
k when the total value of a plurality of said hash values generated by said hash function is n and the number of bits of said binary number is 2 k bits.
- (3i), 1×
-
6. A grouping method according to claim 1, further comprising a step of determining whether an average group duplicate degree of said record list is near to 1 or not, and wherein the steps after said determining step are executed when the average group duplicate degree of said record list is determined to be near to 1.
-
7. A grouping method according to claim 2, further comprising a step of executing an aggregating process for a group to which records belong without depending on the execution result of said classifying step on the basis of each record among a plurality of records of a part respectively determined by said determining step that there is no other records having the same hash value;
- and
wherein the classifying step includes a step of executing the aggregating process to each group among a plurality of groups classified by the classifying step.
- and
-
8. A recording medium having recorded a computer program for executing said method of claim 1 by using a computer.
-
9. A duplicate removal method comprising the steps of:
-
determining whether there is possibility or not for existence of at least another record having a value identical to that of a record by executing a predetermined first process to one or more column values of a plurality of records of a record list included in a database;
executing a second process for deleting records other than one of the duplicate records having the identical value to a plurality of records other than a plurality of partial records determined to have no possibility for existence of the other records having the same value by said determining step among a plurality of records of said record list; and
a plurality of said partial records and a plurality of said records after said deletion are output as the records after duplicate removal. - View Dependent Claims (10, 11, 12, 13, 14, 15)
generating a hash value with the hash function using the values of one or more columns of a record as the argument to each record among a plurality of records of the record list included in the database;
determining whether at least another record having the same hash value as that generated for each record exists or not depending on a plurality of hash values generated for a plurality of said records; and
determining a plurality of partial records, among a plurality of records of said record list, determined by said determining step that there is no other records having the identical hash value as the records having no possibility for existence of the other records having the identical value.
-
-
11. A duplicate removal method according to claim 10, wherein said determining step comprising the steps of:
-
updating, in relation to the generated hash value, counted data for identifying, each time when said hash value is generated for each record, whether the number of records which has generated a plurality of hash value to be generated by said hash function is 0 or 1 or 2 or larger;
executing again said hash value generating step to a plurality of said records after said update is executed to a plurality of said records of said record list; and
determining, based on said counted data, whether the number of records having generated the hash value generated again is 1 or not.
-
-
12. A duplicate removal method according to claim 11, wherein said counted data includes two bits for counting whether the number of records having generated such hash value is 0 or 1 or 2 or larger corresponding to each value among a plurality of hash values which may be generated by said hash function.
-
13. A duplicate removal method according to claim 11, wherein
said counted data expresses the number of records 0, 1, 2 or larger having generated the hash value with the natural numbers such as 0× - (3i), 1×
(3i), 2×
(3i) when the hash value is the i-th hash value corresponding to each value of a plurality of hash values which may be generated by said hash function and a binary number expressing the total sum of such natural numbers corresponding to a plurality of said hash values; and
3n is smaller than 22k and n>
k when the total value of a plurality of said hash values generated by said hash function is n and the number of bits of said binary number is 2 k bits.
- (3i), 1×
-
14. A duplicate removal method claimed in claim 9, further comprising a step of determining whether the average group duplicate degree of said record list is near to 1 or not is further comprised, and wherein the steps after said step of determining the possibility are executed when the average group duplicate degree of said record list is determined to be near to 1.
-
15. A recording medium having recorded a computer program to execute the method of claim 9 by using a computer.
-
16. A grouping method comprising the steps of:
-
generating hash value to each record among a plurality of records of record list included in a data base with a hash function using, as an argument, a value of at least a part of columns of one or more predetermined grouping columns of such records;
determining whether a value of an effective grouping column and aggregation data are stored or not in one entry corresponding to said hash value among a plurality of entries in a prepared aggregating area;
calculating the aggregation data when a record is only one record belonging to one group based on data of predetermined aggregating column of said record when said one entry does not store a value of the effective grouping column and the aggregation data;
storing a value of the grouping column of said record and said calculated aggregation data in said one entry;
determining whether the values of the stored grouping column is identical to the value of grouping column of each record having generated said hash value when said one entry has already stored the value of the grouping column and aggregation data;
updating, when the value of said grouping column stored in one entry is identical to the value of grouping column of each record having generated said hash value, said aggregation data stored in said entry to the data obtained when each record having generated said hash value is added to the group corresponding to said one entry based on the data of said aggregation column of each record having generated said hash value;
storing the data of grouping column of each record and data of aggregation column as the records not grouped when the value of the grouping column stored in said entry is different from the value of the grouping column of each record having generated said hash value;
outputting, to a plurality of records of said record list, data stored in a plurality of entries of said aggregating area as data in regard to one group after execution of the steps from said generating step to said registering step; and
changing a hash function used in said generating step during the repetition of the steps by repeating the steps from said generating step to said output step to a plurality of records stored as the records not grouped in said storing step until the grouping is made for all records. - View Dependent Claims (17, 18)
-
Specification