Generating anonymous data from web data
First Claim
1. A method to provide synthetic data, when statistical properties and privacy information, associated with web data, are preserved in the synthetic data, comprising:
- receiving, by a device and from user devices, the web data,the web data being associated with the user devices,the web data being generated based on interactions of the user devices with one or more content provider devices via a network, andthe web data including one or more of;
clickstream data that includes information associated with portions of content, provided by the one or more content provider devices, that are selected via the user devices,location data that includes information associated with locations of the user devices when the content is accessed by the user devices,time data that includes information associated with times when the user devices access the content, ornetwork data that includes information associated with network resources utilized by the user devices to access the content;
removing, by the device, erroneous or objectionable web data from the web data to generate a subset of the web data;
categorizing, by the device, the subset of the web data by assigning categories to the subset of the web data;
performing, by the device, an empirical estimation of the categorized subset of the web data to generate empirical estimations that include information that provides a representation of behaviors associated with users of the user devices;
receiving, by the device, a selection of an anonymity level associated with generating the synthetic data;
performing, by the device, a simulation of the empirical estimations to generate the synthetic data,the synthetic data including information associated with the empirical estimations, andthe synthetic data removing private information, relating to the user devices and the users of the user devices, in accordance to the anonymity level;
determining, by the device, whether the statistical properties and the privacy information, associated with the web data, are preserved in the synthetic data; and
selectively;
storing, by the device, the synthetic data in a storage device and providing the synthetic data when the statistical properties and the privacy information, associated with the web data, are preserved in the synthetic data, orre-performing, by the device, the simulation of the empirical estimations to generate other synthetic data when the statistical properties or the privacy information, associated with the web data, is not preserved in the synthetic data.
1 Assignment
0 Petitions
Accused Products
Abstract
A device receives web data, associated with user devices, that is generated based on interactions of the user devices with a network and one or more content provider devices. The device removes erroneous or objectionable web data from the web data to generate a subset of the web data, and categorizes the subset of the web data by assigning categories to the subset of the web data. The device performs an empirical estimation of the categorized subset of the web data to generate empirical estimations. The device performs a simulation of the empirical estimations to generate synthetic data that corresponds to the web data and removes private information relating to the user devices and users of the user devices, and stores the synthetic data in a storage device.
-
Citations
20 Claims
-
1. A method to provide synthetic data, when statistical properties and privacy information, associated with web data, are preserved in the synthetic data, comprising:
-
receiving, by a device and from user devices, the web data, the web data being associated with the user devices, the web data being generated based on interactions of the user devices with one or more content provider devices via a network, and the web data including one or more of; clickstream data that includes information associated with portions of content, provided by the one or more content provider devices, that are selected via the user devices, location data that includes information associated with locations of the user devices when the content is accessed by the user devices, time data that includes information associated with times when the user devices access the content, or network data that includes information associated with network resources utilized by the user devices to access the content; removing, by the device, erroneous or objectionable web data from the web data to generate a subset of the web data; categorizing, by the device, the subset of the web data by assigning categories to the subset of the web data; performing, by the device, an empirical estimation of the categorized subset of the web data to generate empirical estimations that include information that provides a representation of behaviors associated with users of the user devices; receiving, by the device, a selection of an anonymity level associated with generating the synthetic data; performing, by the device, a simulation of the empirical estimations to generate the synthetic data, the synthetic data including information associated with the empirical estimations, and the synthetic data removing private information, relating to the user devices and the users of the user devices, in accordance to the anonymity level; determining, by the device, whether the statistical properties and the privacy information, associated with the web data, are preserved in the synthetic data; and selectively; storing, by the device, the synthetic data in a storage device and providing the synthetic data when the statistical properties and the privacy information, associated with the web data, are preserved in the synthetic data, or re-performing, by the device, the simulation of the empirical estimations to generate other synthetic data when the statistical properties or the privacy information, associated with the web data, is not preserved in the synthetic data. - View Dependent Claims (2, 3, 4, 5, 16, 19)
-
-
6. A device for providing synthetic data, when statistical properties and privacy information, associated with web data, are preserved in the synthetic data, comprising:
-
one or more processors to; receive, from user devices, the web data, the web data being generated based on interactions of the user devices with a plurality of content provider devices via a network, and the web data including one or more of; clickstream data that includes information associated with portions of content, provided by the plurality of content provider devices, that are selected via the user devices, location data that includes information associated with locations of the user devices when the content is accessed by the user devices, time data that includes information associated with times when the user devices access the content, or network data that includes information associated with network resources utilized by the user devices to access the content; remove erroneous or objectionable web data from the web data to generate a subset of the web data; categorize the subset of the web data by assigning categories to the subset of the web data; perform an empirical estimation of the categorized subset of the web data to generate empirical estimations that include information that provides a representation of behaviors associated with users of the user devices; receive preference information for an anonymity level associated with generating synthetic data; perform a simulation of the empirical estimations to generate the synthetic data, the synthetic data including properties of the empirical estimations, and the synthetic data removing private information, relating to the user devices and the users of the user devices, in accordance with the preference information; determine whether the statistical properties and the privacy information, associated with the web data, are preserved in the synthetic data; and selectively; store the synthetic data in a storage device, and provide the synthetic data when the statistical properties and the privacy information, associated with the web data, are preserved in the synthetic data, or re-perform the simulation of the empirical estimations to generate other synthetic data when the statistical properties or the privacy information, associated with the web data, is not preserved in the synthetic data. - View Dependent Claims (7, 8, 9, 10, 17, 20)
-
-
11. A computer-readable medium for storing instructions, the instructions comprising:
-
one or more instructions that, when executed by one or more processors of a device for providing synthetic data, when statistical properties and privacy information, associated with web data, are preserved in the synthetic data, cause the one or more processors to; receive, from user devices, the web data, the web data being generated based on interactions of the user devices with one or more content provider devices via a network, the web data including private information regarding at least one of the user devices or one or more users of the user devices, and the web data including at least one of; clickstream data that includes information associated with portions of content, provided by the one or more content provider devices, that are selected via the user devices, location data that includes information associated with locations of the user devices when the content is accessed by the user devices, time data that includes information associated with times when the user devices access the content, or network data that includes information associated with network resources utilized by the user devices to access the content; remove erroneous or objectionable web data from the web data to generate a subset of the web data; categorize the subset of the web data by assigning categories to the subset of the web data; perform an empirical estimation of the categorized subset of the web data to generate empirical estimations that include information that provides a representation of behaviors associated with the one or more users of the user devices; receive a selection of an anonymity preference associated with generating synthetic data; perform a simulation of the empirical estimations to generate the synthetic data, the synthetic data including information associated with the empirical estimations, and the synthetic data removing the private information from the web data in accordance with the anonymity preference; determine whether the statistical properties and the privacy information, associated with the web data, are preserved in the synthetic data; and selectively store the synthetic data in a storage device and provide the synthetic data when the statistical properties and the privacy information, associated with the web data, are preserved in the synthetic data. - View Dependent Claims (12, 13, 14, 15, 18)
-
Specification