Generation Method and Device for generating anonymous dataset, and method and device for risk evaluation

US 20140189858A1
Filed: 12/27/2012
Published: 07/03/2014
Est. Priority Date: 12/27/2012
Status: Active Grant

First Claim

Patent Images

1. An anonymous dataset generation method, comprising:

acquiring a critical attribute set and a quasi-identifier set, wherein the critical attribute set comprises at least one critical attribute, the quasi-identifier set comprises a plurality of quasi-identifiers, and one of the at least one critical attribute or one of the quasi-identifiers is set as an anchor attribute;

generating an equivalence table according to the quasi-identifier set, the critical attribute set and an original dataset, wherein the equivalence table comprises a plurality of equivalence classes, each of the equivalence classes comprises at least one equivalence data, and each equivalence data comprises a plurality of original values corresponding to the quasi-identifiers respectively;

generating a plurality of data clusters of a cluster table sequentially according to the equivalence table, wherein each of the data clusters comprises at least one of the equivalence classes; and

generalizing content of the cluster table to generate and output an anonymous dataset corresponding to the original dataset, wherein the original values corresponding to the anchor attribute are maintained originally in the anonymous dataset.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An anonymous dataset generation method comprises following steps. A critical attribute set and a quasi-identifier (QID) set are acquired, and one of the critical attribute and the quasi-identifier is set as an anchor attribute. An attribute sequence and an equivalence table are generated according to the quasi-identifier set and the critical attribute set. A data cluster and a cluster table are generated according to the equivalence table. The content of the cluster table is generalized to generate and output an anonymous dataset corresponding to an original dataset. A risk evaluation method for an anonymous dataset calculates data weight to extract distinctive data and to attacking defects of the anonymous dataset according to the distinctive data, thereby enhancing a risk evaluation efficiency of the anonymous dataset.

Citations

48 Claims

1. An anonymous dataset generation method, comprising:
- acquiring a critical attribute set and a quasi-identifier set, wherein the critical attribute set comprises at least one critical attribute, the quasi-identifier set comprises a plurality of quasi-identifiers, and one of the at least one critical attribute or one of the quasi-identifiers is set as an anchor attribute;
  
  generating an equivalence table according to the quasi-identifier set, the critical attribute set and an original dataset, wherein the equivalence table comprises a plurality of equivalence classes, each of the equivalence classes comprises at least one equivalence data, and each equivalence data comprises a plurality of original values corresponding to the quasi-identifiers respectively;
  
  generating a plurality of data clusters of a cluster table sequentially according to the equivalence table, wherein each of the data clusters comprises at least one of the equivalence classes; and
  
  generalizing content of the cluster table to generate and output an anonymous dataset corresponding to the original dataset, wherein the original values corresponding to the anchor attribute are maintained originally in the anonymous dataset.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The anonymous dataset generation method according to claim 1, wherein the step of acquiring the critical attribute set and the quasi-identifier set comprises:
    - reading the quasi-identifier set and the critical attribute set; and
      
      deleting all the at least one critical attribute, which does not belong to the quasi-identifier set, from the critical attribute set when any one of the at least one critical attribute does not belong to the quasi-identifier set.
  - 3. The anonymous dataset generation method according to claim 1, wherein the step of generating the equivalence table according to the quasi-identifier set, the critical attribute set and the original dataset, comprises:
    - extracting the equivalence table from the original dataset according to the quasi-identifier set, wherein each of the equivalence classes comprises a corresponding quantity of the at least one equivalence data;
      
      generating an attribute sequence according to the quasi-identifier set and the critical attribute set;
      
      generating a plurality of value codes corresponding to the original values;
      
      encoding the equivalence classes according to the attribute sequence and the value codes, to generate a plurality of equivalence codes; and
      
      sorting the equivalence classes of the equivalence table according to the equivalence codes, and outputting the sorted equivalence table.
  - 4. The anonymous dataset generation method according to claim 3, wherein the step of generating the value codes corresponding to the original values comprises:
    - for at least one of the quasi-identifiers belonging to a numeric type attribute, setting the corresponding original values as the corresponding value codes; and
      
      for each one of the quasi-identifiers belonging to a categorical type attribute, generating a taxonomy tree according to the corresponding original values, and encoding the corresponding original values via the taxonomy tree to obtain the corresponding value codes.
  - 5. The anonymous dataset generation method according to claim 4, wherein each of the at least one critical attribute is one of the quasi-identifiers, and forms the attribute sequence through an attribute sequence rule, and the attribute sequence rule comprises a first rule, a second rule, a third rule and a fourth rule;
    - the first rule requires that a priority of the at least one critical attribute is higher than a priority of the at least one quasi-identifier which does not belong to the at least one critical attribute;
      
      the second rule requires that a priority of the quasi-identifier of the categorical type attribute is higher than a priority of the quasi-identifier of the numeric type attribute;
      
      the third rule requires that a priority of the quasi-identifier corresponding to a lower height of the taxonomy tree is higher than a priority of the quasi-identifier corresponding to a higher height of the taxonomy tree;
      
      the fourth rule requires that a priority of the quasi-identifier corresponding to a lower original value variance is higher than a priority of the quasi-identifier corresponding to a higher original value variance; and
      
      the critical attribute or the quasi-identifier, which has the highest priority, is set as the anchor attribute.
  - 6. The anonymous dataset generation method according to claim 1, wherein each of the equivalence classes comprises a corresponding quantity of the at least one equivalence data, and the step of generating the cluster table according to the equivalence table comprises:
    - adding the equivalence classes to the data clusters sequentially according to the corresponding quantities; and
      
      when a total quantity of the plurality of equivalence data in any one of the data clusters is smaller than an anonymous parameter, performing steps of;
      
      setting the data cluster, which the total quantity of the plurality of equivalence data is smaller than the anonymous parameter, as a first cluster;
      
      setting the data cluster before the first cluster, as a second cluster; and
      
      merging the first cluster and the second cluster when the original values corresponding to the anchor attribute, in the first cluster and the second cluster are the same.
  - 7. The anonymous dataset generation method according to claim 6, wherein the step of adding the equivalence classes to the data clusters according to the corresponding quantities sequentially comprises:
    - reading the equivalence classes sequentially; and
      
      when the read equivalence class is the first in the equivalence table, performing steps of;
      
      adding a temporary cluster, and setting the temporary cluster to correspond to one of the data clusters;
      
      recording the original value corresponding to the anchor attribute, in the read equivalence class to be a current anchor value;
      
      accumulating an accumulative quantity according to the corresponding quantity of the read equivalence class; and
      
      adding the read equivalence class to the temporary cluster according to the anonymous parameter and the accumulative quantity.
  - 8. The anonymous dataset generation method according to claim 7, wherein the step of adding the equivalence classes to the data clusters sequentially according to the corresponding quantities further comprises:
    - when the read equivalence class is not the first in the equivalence table, and when the original value corresponding to the anchor attribute, in the read equivalence class is the same as the current anchor value, performing steps of;
      
      accumulating the accumulative quantity according to the corresponding quantity of the read equivalence class; and
      
      adding the read equivalence class to the temporary cluster according to the anonymous parameter and the accumulative quantity.
  - 9. The anonymous dataset generation method according to claim 8, wherein the step of adding the equivalence classes to the data clusters sequentially according to the corresponding quantities further comprises:
    - when the read equivalence class is not the first in the equivalence table, and when the original value corresponding to the anchor attribute, in the read equivalence class is not the same as the current anchor value, performing steps of;
      
      storing the temporary cluster as the corresponding data cluster;
      
      initializing the temporary cluster and the accumulative quantity, and setting the initialized temporary cluster to correspond to a next data cluster after the stored data cluster;
      
      adding the read equivalence class to the initialized temporary cluster according to the anonymous parameter and the corresponding quantity;
      
      accumulating the initialized accumulative quantity according to the corresponding quantity of the read equivalence classes; and
      
      recording the original value corresponding to the anchor attribute, in the read equivalence class as the current anchor value.
  - 10. The anonymous dataset generation method according to claim 8, wherein the step of adding the read equivalence class to the temporary cluster according to the anonymous parameter and the accumulative quantity comprises:
    - adding all the at least one equivalence data of the read equivalence class to the temporary cluster when the accumulative quantity is smaller than the anonymous parameter.
  - 11. The anonymous dataset generation method according to claim 10, wherein the step of adding the read equivalence class to the temporary cluster according to the anonymous parameter and the accumulative quantity further comprises:
    - when the accumulative quantity is equal to the anonymous parameter, or when the accumulative quantity is bigger than the anonymous parameter and is smaller than twice of the anonymous parameter, performing steps of;
      
      adding all the at least one equivalence data of the read equivalence class to the temporary cluster;
      
      storing the temporary cluster as the corresponding data cluster; and
      
      initializing the temporary cluster and the accumulative quantity, and setting the initialized temporary cluster to correspond to a next data cluster after the corresponding data cluster.
  - 12. The anonymous dataset generation method according to claim 11, wherein the anonymous parameter is a positive integer bigger than 1, the corresponding quantity of the read equivalence classes is bigger than the anonymous parameter, and the step of adding the read equivalence class to the temporary cluster according to the anonymous parameter and the accumulative quantity further comprises:
    - when the accumulative quantity is bigger than or equal to twice of the anonymous parameter, performing steps of;
      
      dividing all the plurality of equivalence data of the read equivalence class into a first group and a second group, wherein the first group comprises at least one of the plurality of equivalence data, and the second group comprises at least one of the remaining equivalence data in the read equivalence class;
      
      adding the at least one equivalence data of the first group to the temporary cluster;
      
      storing the temporary cluster as the corresponding data cluster;
      
      initializing the temporary cluster and the accumulative quantity, and setting the initialized temporary cluster to correspond to a next data cluster after the corresponding data cluster;
      
      adding the at least one equivalence data of the second group to the initialized temporary cluster;
      
      storing the temporary cluster with the second group, as the corresponding data cluster; and
      
      initializing the temporary cluster and the accumulative quantity, and setting the initialized temporary cluster to correspond to a next data cluster after the corresponding data cluster with the second group.
  - 13. The anonymous dataset generation method according to claim 1, wherein each of the data clusters comprises a cluster code and at least one of the equivalence classes, and the step of generalizing the content of the cluster table to generate and output the anonymous dataset corresponding to the original dataset, comprises:
    - reading the equivalence classes of the data clusters sequentially;
      
      when the read equivalence class is the first in the cluster table, setting the read equivalence class as a temporary generalized model, wherein the temporary generalized model comprises a plurality of first attribute values corresponding to the quasi-identifiers respectively, and the original values of the first attribute values are the original values of first one of the data clusters;
      
      when the read equivalence class is not the first in the cluster table, and when the read equivalence class is the same as the cluster code corresponding to the temporary generalized model, performing steps of;
      
      searching for a smallest generalized model between the read equivalence class and the temporary generalized model; and
      
      storing the smallest generalized model as a updated temporary generalized model; and
      
      when the read equivalence class is not the first in the cluster table, and when the read equivalence class is different from the cluster code corresponding to the temporary generalized model, performing steps of;
      
      storing the temporary generalized model in the anonymous dataset; and
      
      setting the read equivalence class as the temporary generalized model.
  - 14. The anonymous dataset generation method according to claim 13, wherein the smallest generalized model comprises a plurality of second attribute values corresponding to the quasi-identifiers respectively, and the step of searching for the smallest generalized model between the read equivalence class and the temporary generalized model comprises:
    - setting the quasi-identifiers as a current identifier sequentially;
      
      setting the first attribute value corresponding to the current identifier, to be the second attribute value of the smallest generalized model when the first attribute value and the original value, which correspond to the current identifier, are the same;
      
      generating a generalized value range according to the first attribute value, and the original value, which correspond to the current identifier, and setting the generalized value range as the second attribute value of the smallest generalized model, when the first attribute value and the original value, which correspond to the current identifier, are different and the current identifier belongs to a numeric type attribute; and
      
      generating a generalized string according to a taxonomy tree, the first attribute value and the original value, which correspond to the current identifier, and setting the generalized string as the second attribute value of the smallest generalized model, when the first attribute value and the original value, which correspond to the current identifier, are different and the current identifier belongs to a categorical type attribute.

15. An anonymous dataset generation device, comprising:
- a memory, for storing data or storing data temporarily; and
  
  a processor, coupled to the memory, and comprising;
  
  an equivalence generation module, for performing steps of;
  
  acquiring a critical attribute set and a quasi-identifier set, wherein the critical attribute set comprises at least one critical attribute, the quasi-identifier set comprises a plurality of quasi-identifiers, and one of the at least one critical attribute or one of the quasi-identifiers is set as an anchor attribute; and
  
  generating an equivalence table according to the quasi-identifier set, the critical attribute set and an original dataset, wherein the equivalence table comprises a plurality of equivalence classes, each of the equivalence classes comprises at least one equivalence data, and each equivalence data comprises a plurality of original values corresponding to the quasi-identifiers respectively;
  
  a cluster generation module, for generating a plurality of data clusters of a cluster table according to the equivalence table sequentially, wherein each of the data clusters comprises at least one of the equivalence classes; and
  
  a data generalization module, for generalizing content of the cluster table to generate and output an anonymous dataset corresponding to the original dataset, wherein the original values corresponding to the anchor attribute are maintained originally in the anonymous dataset.
- View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
- - 16. The anonymous dataset generation device according to claim 15, wherein in the step of acquiring the critical attribute set and the quasi-identifier set, the equivalence generation module performs steps of:
    - reading the quasi-identifier set and the critical attribute set; and
      
      deleting all the at least one critical attribute, not belonging to the quasi-identifier set, from the critical attribute set when any one of the at least one critical attribute does not belong to the quasi-identifier set.
  - 17. The anonymous dataset generation device according to claim 15, wherein in the step of generating the equivalence table according to the quasi-identifier set, the critical attribute set and the original dataset, the equivalence generation module performs steps of:
    - extracting the equivalence table from the original dataset according to the quasi-identifier set, wherein each of the equivalence classes comprises a corresponding quantity of the at least one equivalence data;
      
      generating an attribute sequence according to the quasi-identifier set and the critical attribute set;
      
      generating a plurality of value codes corresponding to the original values;
      
      encoding the equivalence classes according to the attribute sequence and the value codes to generate a plurality of equivalence codes; and
      
      sorting the equivalence classes of the equivalence table according to the equivalence codes, and outputting the sorted equivalence table.
  - 18. The anonymous dataset generation device according to claim 17, wherein in the step of generating the value codes corresponding to the original values, the equivalence generation module performs steps of:
    - for the at least one of the quasi-identifiers belonging to a numeric type attribute, setting the corresponding original values as the corresponding value codes; and
      
      for each one of the quasi-identifiers belonging to a categorical type attribute, generating a taxonomy tree according to the corresponding original values, and encoding the corresponding original values via the taxonomy tree to obtain the corresponding value codes.
  - 19. The anonymous dataset generation device according to claim 18, wherein each of the at least one critical attribute is one of the quasi-identifiers, the equivalence generation module employs an attribute sequence rule to generate the attribute sequence, and the attribute sequence rule comprises a first rule, a second rule, a third rule and a fourth rule;
    - the first rule requires that a priority of the at least one critical attribute is higher than a priority of the at least one quasi-identifier which does not belong to the at least one critical attribute;
      
      the second rule requires that a priority of the quasi-identifier of the categorical type attribute is higher than a priority of the quasi-identifier of the numeric type attribute;
      
      the third rule requires that a priority of the quasi-identifier corresponding to a shorter height of the taxonomy tree is higher than a priority of the quasi-identifier corresponding to a taller height of the taxonomy tree;
      
      the fourth rule requires that a priority of the quasi-identifier corresponding to a lower original value variance is higher than a priority of the quasi-identifier corresponding to a higher original value variance; and
      
      the first critical attribute or the first quasi-identifier, which has the highest priority, is set as the anchor attribute.
  - 20. The anonymous dataset generation device according to claim 15, wherein each of the equivalence classes comprises a corresponding quantity of the at least one equivalence data, and the cluster generation module performs steps of:
    - adding the equivalence classes to the data clusters according to the corresponding quantities sequentially; and
      
      when a total quantity of the equivalence data in any one of the data clusters is smaller than an anonymous parameter, the cluster generation module performing steps of;
      
      setting the data cluster, which the total quantity of the equivalence data is smaller than the anonymous parameter, as a first cluster;
      
      setting the data cluster before the first cluster, as a second cluster; and
      
      merging the first cluster and the second cluster when the original values corresponding to the anchor attribute, in the first cluster and the second cluster are the same.
  - 21. The anonymous dataset generation device according to claim 20, wherein the cluster generation module performs steps of:
    - reading the equivalence classes sequentially; and
      
      when the read equivalence class is the first in the equivalence table, the cluster generation module performing steps of;
      
      adding a temporary cluster, and setting the temporary cluster to correspond to one of the data clusters;
      
      recording the original value corresponding to the anchor attribute, in the read equivalence class to be a current anchor value;
      
      accumulating an accumulative quantity according to the corresponding quantity of the read equivalence class; and
      
      adding the read equivalence class to the temporary cluster according to the anonymous parameter and the accumulative quantity.
  - 22. The anonymous dataset generation device according to claim 21, wherein the cluster generation module further performs steps of:
    - when the read equivalence class is not the first in the equivalence table, and when the original value corresponding to the anchor attribute, in the read equivalence class is the same as the current anchor value, the cluster generation module performing steps of;
      
      accumulating the accumulative quantity according to the corresponding quantity of the read equivalence class; and
      
      adding the read equivalence class to the temporary cluster according to the anonymous parameter and the accumulative quantity.
  - 23. The anonymous dataset generation device according to claim 22, wherein the cluster generation module further performs steps of:
    - when the read equivalence class is not the first equivalence class in the equivalence table, and when the original value corresponding to the anchor attribute, in the read equivalence class is different from the current anchor value, the cluster generation module performing steps of;
      
      storing the temporary cluster as the corresponding data cluster;
      
      initializing the temporary cluster and the accumulative quantity, and setting the initialized temporary cluster to correspond to a next data cluster after the corresponding data cluster;
      
      adding the read equivalence class to the initialized temporary cluster according to the anonymous parameter and the corresponding quantity;
      
      accumulating the initialized accumulative quantity according to the corresponding quantity of the read equivalence class; and
      
      recording the original value corresponding to the anchor attribute, in the read equivalence class to be the current anchor value.
  - 24. The anonymous dataset generation device according to claim 22, wherein in the step of adding the read equivalence class to the temporary cluster according to the anonymous parameter and the accumulative quantity, the cluster generation module performs a step of:
    - adding all the at least one equivalence data of the read equivalence class to the temporary cluster when the accumulative quantity is smaller than the anonymous parameter.
  - 25. The anonymous dataset generation device according to claim 24, wherein the cluster generation module further performs steps of:
    - when the accumulative quantity is equal to the anonymous parameter, or when the accumulative quantity is bigger than the anonymous parameter and is smaller than twice of the anonymous parameter, the cluster generation module performing steps of;
      
      adding all the at least one equivalence data of the read equivalence class to the temporary cluster;
      
      storing the temporary cluster as the corresponding data cluster; and
      
      initializing the temporary cluster and the accumulative quantity, and setting the initialized temporary cluster to correspond to a next data cluster after the corresponding data cluster.
  - 26. The anonymous dataset generation device according to claim 25, wherein the anonymous parameter is a positive integer bigger than 1, the corresponding quantity of the read equivalence class is bigger than the anonymous parameter, and the cluster generation module further performs steps of:
    - when the accumulative quantity is bigger than or equal to twice of the anonymous parameter, the cluster generation module performing steps of;
      
      dividing all the equivalence data of the read equivalence class into a first group and a second group, the first group comprises at least one of the equivalence data, the second group comprises at least one of the remaining equivalence data in the read equivalence class;
      
      adding the at least one equivalence data of the first group to the temporary cluster;
      
      storing the temporary cluster as the corresponding data cluster;
      
      initializing the temporary cluster and the accumulative quantity, and setting the initialized temporary cluster to correspond to a next data cluster after the stored data cluster;
      
      adding the at least one equivalence data of the second group to the initialized temporary cluster;
      
      storing the temporary cluster with the second group, as the corresponding data cluster; and
      
      initializing the temporary cluster and the accumulative quantity, and setting the initialized temporary cluster to correspond to a next data cluster after the data cluster stored with the second group.
  - 27. The anonymous dataset generation device according to claim 15, wherein each of the data clusters comprises a cluster code and at least one of the equivalence classes, and the data generalization module performs steps of:
    - reading the equivalence classes of the data clusters sequentially;
      
      when the read equivalence class is the first in the cluster table, setting the first equivalence class as a temporary generalized model, wherein the temporary generalized model comprises a plurality of first attribute values corresponding to the quasi-identifiers respectively, and original values of the first attribute values are the original values of first one of the data clusters;
      
      when the read equivalence class is not the first in the cluster table, and when the read equivalence class is the same as the cluster code corresponding to the temporary generalized model, the data generalization module performing steps of;
      
      searching for a smallest generalized model between the read equivalence class and the temporary generalized model; and
      
      storing the smallest generalized model as an updated temporary generalized model; and
      
      when the read equivalence class is not the first in the cluster table, and when the read equivalence class is different from the cluster code corresponding to the temporary generalized model, the data generalization module performing steps of;
      
      storing the current temporary generalized model in the anonymous dataset; and
      
      setting the read equivalence class as the temporary generalized model.
  - 28. The anonymous dataset generation device according to claim 27, wherein the smallest generalized model comprises a plurality of second attribute values corresponding to the quasi-identifiers respectively, and the data generalization module further performs steps of:
    - setting the quasi-identifiers as a current identifier sequentially;
      
      setting the first attribute value corresponding to the current identifier, as the second attribute value of the smallest generalized model, when the first attribute value and the original value, which correspond to the current identifier, are the same;
      
      generating a generalized value range according to the first attribute value and the original value, which correspond to the current identifier, and setting the generalized value range as the second attribute value of the smallest generalized model, when the first attribute value and the original value, which correspond to the current identifier, are different and the current identifier belongs to a numeric type attribute; and
      
      generating a generalized string according to a taxonomy tree, the first attribute value and the original value, which correspond to the current identifier, and setting the generalized string as the second attribute value of the smallest generalized model, when the first attribute value and the original value, which correspond to the current identifier, are different and the current identifier belongs to a categorical type attribute.

29. A risk evaluation method, for evaluating an anonymous dataset generated according to an original dataset, and comprising:
- acquiring a plurality of appearing times respectively corresponding to a plurality of original values of the original dataset;
  
  generating a partition set and a weight table according to a sample parameter, an anonymous parameter and the appearing times;
  
  dividing the original dataset into a plurality of data partitions according to the partition set, and generating a penetration dataset according to the weight table and the data partitions, wherein the penetration dataset comprises a plurality of sample data;
  
  comparing each sample data with a plurality of anonymous data of the anonymous dataset to obtain a plurality of matching quantities respectively corresponding to the sample data; and
  
  calculating and outputting a risk evaluation result according to the matching quantities.
- View Dependent Claims (30, 31, 32, 33, 34, 35, 36, 37, 38)
- - 30. The risk evaluation method according to claim 29, wherein the anonymous dataset has a quasi-identifier set, the quasi-identifier set comprises a plurality of quasi-identifiers, the original dataset comprises the original values corresponding to the quasi-identifiers, and the appearing times are times the corresponding original values appear in the original dataset.
  - 31. The risk evaluation method of the anonymous dataset according to claim 30, wherein the step of generating the partition set and the weight table according to the sample parameter, the anonymous parameter and the appearing times comprises:
    - arranging the quasi-identifiers to generate a plurality of candidate combinations, wherein each of the candidate combinations comprises at least one of the quasi-identifiers;
      
      calculating a plurality of original value combinational numbers respectively corresponding to the candidate combinations;
      
      selecting the smallest original value combinational number from at least one of the original value combinational numbers, which is bigger than or equal to the sample parameter, and setting the candidate combination corresponding to the selected original value combinational number, to be the partition set; and
      
      generating the weight table according to the sample parameter, the anonymous parameter and the appearing times.
  - 32. The risk evaluation method according to claim 31, wherein the weight table comprises a plurality of weight values respectively corresponding to the original values, and the step of generating the weight table according to the sample parameter, the anonymous parameter and the appearing times comprises:
    - calculating a weight parameter, wherein a product of the weight parameter and the anonymous parameter is bigger than or equal to the largest one of the appearing times;
      
      reading the original values sequentially;
      
      when the appearing times corresponding to the current original value is bigger than the anonymous parameter, the weight value corresponding to the current original value being equal to the product of the weight parameter and the anonymous parameter, minus the appearing time corresponding to the current original value, and plus the sample parameter; and
      
      when the appearing times corresponding to the current original value is smaller than or equal to the anonymous parameter, the weight value corresponding to the current original value being equal to the product of the weight parameter and the anonymous parameter, plus the appearing times corresponding to the current original value, and plus the sample parameter.
  - 33. The risk evaluation method according to claim 29, wherein a quantity of the data partitions is bigger than or equal to the sample parameter.
  - 34. The risk evaluation method according to claim 29, wherein the step of dividing the original dataset into the data partitions according to the partition set and generating the penetration dataset according to the weight table and the data partitions, comprises:
    - dividing the original dataset into the data partitions according to the partition set, wherein each of the data partitions comprises at least one original data;
      
      reading the data partitions sequentially, and calculating an original weight of each of the at least one original data in the current data partition via the weight table;
      
      selecting one of the at least one original data from the current data partition according to the original weight, and setting the selected original data as one of the sample data; and
      
      updating the weight table according to the selected original data.
  - 35. The risk evaluation method according to claim 34, wherein the anonymous dataset comprises a quasi-identifier set, the quasi-identifier set comprises a plurality of quasi-identifiers, each original data comprises the original values respectively corresponding to the quasi-identifiers, the original values respectively correspond to a plurality of weight values of the weight table, and the step of updating the weight table according to the selected original data comprises:
    - subtracting the weight values corresponding to the selected original data by 1.
  - 36. The risk evaluation method according to claim 29, wherein the anonymous dataset comprises a quasi-identifier set, the quasi-identifier set comprises a plurality of quasi-identifiers, each sample data comprises the original values respectively corresponding to the quasi-identifiers, each anonymous data comprises a plurality of third attribute values respectively corresponding to the quasi-identifiers, and the step of comparing each sample data with the plurality of anonymous data of the anonymous dataset to obtain the matching quantities respectively corresponding to the sample data, comprises:
    - reading the plurality of sample data sequentially, and for each sample data, performing steps of;
      
      comparing the original values of the current sample data with the third attribute values of the current anonymous data according to the quasi-identifiers for the plurality of anonymous data sequentially;
      
      setting the current anonymous data as a matching data when each original value and each third attribute value, which correspond to each other, are in a same attribute level; and
      
      setting a quantity of the matching data corresponding to the current sample data, to be the corresponding matching quantity.
  - 37. The risk evaluation method according to claim 36, wherein each of the third attribute values belonging to a numeric type attribute is a generalized value range, when the original value of the current sample data is in the corresponding generalized value range, the original value of the current sample data and the corresponding third attribute value are at the same attribute level;
    - and each of the third attribute values belonging to a categorical type attribute is a generalized string, when the original value of the current sample data belongs to the corresponding generalized string, the original value of the current sample data and the corresponding third attribute value are at the same attribute level.
  - 38. The risk evaluation method according to claim 29, wherein the risk evaluation result comprises a maximum risk probability, a minimum risk probability or an average risk probability.

39. A risk evaluation device for evaluating an anonymous dataset generated according to an original dataset, comprising:
- a memory, for storing data or storing data temporarily; and
  
  a processor, coupled to the memory, and comprising;
  
  a weight generation module, for acquiring a plurality of appearing times respectively corresponding to a plurality of original values of the original dataset, and for generating a partition set and a weight table according to a sample parameter, an anonymous parameter and the appearing times;
  
  a sample generation module, for dividing the original dataset into a plurality of data partitions according to the partition set, and for generating a penetration dataset according to the weight table and the data partitions, wherein the penetration dataset comprises a plurality of sample data; and
  
  a risk evaluation module, for comparing each sample data with a plurality of anonymous data of the anonymous dataset in order to obtain a plurality of matching quantities respectively corresponding to the plurality of sample data, and for calculating and outputting a risk evaluation result according to the matching quantities.
- View Dependent Claims (40, 41, 42, 43, 44, 45, 46, 47, 48)
- - 40. The risk evaluation device according to claim 39, wherein the anonymous dataset comprises a quasi-identifier set, the quasi-identifier set comprises a plurality of quasi-identifiers, the original dataset comprises the original values corresponding to the quasi-identifiers, and the appearing times are times the corresponding original values appear in the original dataset.
  - 41. The risk evaluation device according to claim 40, wherein the weight generation module performs steps of:
    - arranging the quasi-identifiers to generate a plurality of candidate combinations, wherein each of the candidate combinations comprises at least one of the quasi-identifiers;
      
      calculating a plurality of original value combinational numbers respectively corresponding to the candidate combinations;
      
      selecting the smallest one from at least one of the original value combinational numbers bigger than or equal to the sample parameter, and setting the candidate combination corresponding to the smallest original value combinational number, to be the partition set; and
      
      generating the weight table according to the sample parameter, the anonymous parameter and the appearing times.
  - 42. The risk evaluation device according to claim 41, wherein the weight table comprises a plurality of weight values respectively corresponding to the original values, and in the step of generating the weight table according to the sample parameter, the anonymous parameter and the appearing times, the weight generation module performs steps of:
    - calculating a weight parameter, wherein a product of the weight parameter and the anonymous parameter is bigger than or equal to the largest one of the appearing times;
      
      reading the original values sequentially;
      
      when the appearing times corresponding to the current original value is bigger than the anonymous parameter, the weight value corresponding to the current original value is equal to the product of the weight parameter and the anonymous parameter, minus the appearing time corresponding to the current original value, and plus the sample parameter; and
      
      when the appearing times corresponding to the current original value is smaller than or equal to the anonymous parameter, the weight value corresponding to the current original value is equal to the product of the weight parameter and the anonymous parameter, plus the appearing times corresponding to the current original value, and plus the sample parameter.
  - 43. The risk evaluation device according to claim 39, wherein a quantity of the data partitions is bigger than or equal to the sample parameter.
  - 44. The risk evaluation device according to claim 39, wherein the sample generation module performs steps of:
    - dividing the original dataset into the data partitions according to the partition set, wherein each of the data partitions comprises at least one original data;
      
      reading the data partitions sequentially, and calculating an original weight of each of the at least one original data in the current data partition via the weight table;
      
      selecting one of the at least one original data from the current data partition according to the original weight, and setting the selected original data as one of the plurality of sample data; and
      
      updating the weight table according to the selected original data.
  - 45. The risk evaluation device according to claim 44, wherein the anonymous dataset has a quasi-identifier set, the quasi-identifier set comprises a plurality of quasi-identifiers, each original data comprises the original values corresponding to the quasi-identifiers respectively, the original values respectively correspond to the weight values of the weight table, and the sample generation module performs a step of:
    - subtracting the weight values corresponding to the selected original data by 1.
  - 46. The risk evaluation device according to claim 39, wherein the anonymous dataset has a quasi-identifier set, the quasi-identifier set comprises a plurality of quasi-identifiers, each sample data comprises the original values respectively corresponding to the quasi-identifiers, each anonymous data comprises a plurality of third attribute values respectively corresponding to the quasi-identifiers, and the risk evaluation module performs steps of:
    - reading the plurality of sample data sequentially, and for each sample data, performing steps of;
      
      comparing the original values of the current sample data with the third attribute values of the current anonymous data according to the quasi-identifiers for the plurality of anonymous data sequentially;
      
      setting the current anonymous data as a matching data when each original value and each third attribute value, which correspond to each other, are at a same attribute level; and
      
      setting a quantity of the matching data corresponding to the current sample data, to be the corresponding matching quantity.
  - 47. The risk evaluation device according to claim 46, wherein each of the third attribute values belonging to a numeric type attribute is a generalized value range, when the original value of the current sample data is in the corresponding generalized value range, the original value of the current sample data and the corresponding third attribute value are at the same attribute level;
    - and each of the third attribute values belonging to a categorical type attribute is a generalized string, when the original value of the current sample data belongs to the corresponding generalized string, the original value of the current sample data and the corresponding third attribute value are at the same attribute level.
  - 48. The risk evaluation device according to claim 39, wherein the risk evaluation result comprises a maximum risk probability, a minimum risk probability or an average risk probability.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Industrial Technology Research Institute
Original Assignee
Industrial Technology Research Institute
Inventors
Chen, Ya-Ling, Yin, Ding-Jun, Hung, Kuo-Yang

Granted Patent

US 9,129,117 B2
Time in Patent Office

Days
Field of Search
US Class Current

726/22
CPC Class Codes

G06F 16/21   Design, administration or m...

G06F 16/245   Query processing

G06F 21/577   Assessing vulnerabilities a...

G06F 21/6254   by anonymising data, e.g. d...

G06F 2221/034   Test or assess a computer o...

Generation Method and Device for generating anonymous dataset, and method and device for risk evaluation

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

48 Claims

Specification

Solutions

Use Cases

Quick Links

Generation Method and Device for generating anonymous dataset, and method and device for risk evaluation

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

48 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links