REAL-TIME-READY BEHAVIORAL TARGETING IN A LARGE-SCALE ADVERTISEMENT SYSTEM
First Claim
1. A device for running temporal queries over large sets of data stored in a distributed file system, comprising:
- a device for receiving one or more data sets stored in a distributed file system, said data sets comprising a plurality of time-stamped events;
a device for specifying one or more temporal queries to be run in a Map-Reduce (M-R) application;
a device for splitting the temporal queries into a set of continuous query (CQ) fragments which are then evaluated to determine a partitioning key;
using the partitioning key for partitioning and distributing the data sets to a cluster comprising a plurality of computing devices in a map phase of the M-R application;
an embedded data stream management system (DSMS) having one or more DSMS-based algorithms for executing a corresponding CQ fragment on a corresponding one of the partitioned data sets, said DSMS algorithms operating on the plurality of computing devices to output one or more events from execution of the CQ fragments on the partitioned data;
wherein output of the DSMS-based algorithms is written to an in-memory blocking queue, from which a reducer phase of the M-R application reads events synchronously relative to the time-stamp associated with each event, regardless of a DSMS output order for the events;
wherein the reducer phase of the M-R framework completes the CQ of the partitioned data by processing the events that are synchronously read from the blocking queue by performing the same computation on each data partition in parallel each of the computing devices; and
a device for outputting the results of the completed CQs to generate the results of the temporal queries.
2 Assignments
0 Petitions
Accused Products
Abstract
A “Real-Time-Ready Analyzer” combines a data stream management system (DSMS) with a map-reduce (M-R) framework to construct a streaming map-reduce framework that is suitable for real-time Behavioral Targeting (BT) (or other temporal queries). The Real-Time-Ready Analyzer allows users to write “dual-intent” temporal analysis queries for BT. These queries are succinct and easy to express, scale well on large-scale offline data, and can also work over real-time data. Further, the Real-Time-Ready Analyzer uses the aforementioned streaming map-reduce framework to provide dual-intent algorithms for end-to-end BT phases. Experiments using real data from an advertisement system show that the Real-Time-Ready Analyzer is very efficient and incurs orders-of-magnitude lower development effort than conventional systems.
130 Citations
20 Claims
-
1. A device for running temporal queries over large sets of data stored in a distributed file system, comprising:
-
a device for receiving one or more data sets stored in a distributed file system, said data sets comprising a plurality of time-stamped events; a device for specifying one or more temporal queries to be run in a Map-Reduce (M-R) application; a device for splitting the temporal queries into a set of continuous query (CQ) fragments which are then evaluated to determine a partitioning key; using the partitioning key for partitioning and distributing the data sets to a cluster comprising a plurality of computing devices in a map phase of the M-R application; an embedded data stream management system (DSMS) having one or more DSMS-based algorithms for executing a corresponding CQ fragment on a corresponding one of the partitioned data sets, said DSMS algorithms operating on the plurality of computing devices to output one or more events from execution of the CQ fragments on the partitioned data; wherein output of the DSMS-based algorithms is written to an in-memory blocking queue, from which a reducer phase of the M-R application reads events synchronously relative to the time-stamp associated with each event, regardless of a DSMS output order for the events; wherein the reducer phase of the M-R framework completes the CQ of the partitioned data by processing the events that are synchronously read from the blocking queue by performing the same computation on each data partition in parallel each of the computing devices; and a device for outputting the results of the completed CQs to generate the results of the temporal queries. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method for real-time estimation of user-specific click through rates (CTR) for ads, comprising steps for:
-
receiving one or more sets of historical session data derived from one or more groups of users, said historical session data comprising time-stamped click and impression data for a plurality of ads in combination with user-specific features, said features including any of pageviews, URLs and keywords; processing the historical session data to eliminate any data corresponding to users identified as bots to generate a set of bot-free session data; processing the bot-free session data to generate a user behavior profile (UBP) for each user in the bot-free session data; processing the bot-free session data to eliminate statistically insignificant features by performing a statistical hypothesis test that retains only features having a number of associated clicks exceeding a specified minimum click level, and from the retained features, eliminating any features having a statistical association of positive or negative click events that is below a minimum statistical threshold, thereby creating a reduced feature set from the bot-free session data; generating a model of click-through behavior for each ad in the bot-free data for each UBP; for new user session data received in real-time, computing a score for each feature/ad pair based on the bot-free data and the reduced feature set for each new user based on the model of click-through behavior for each ad; estimating a CTR for each ad from the computed score for each ad; and serving one or more ads in real-time to each new user based on the estimated CTR for each ad for each new user. - View Dependent Claims (12, 13, 14)
-
-
15. A computer-readable medium having computer executable instructions stored therein for running continuous queries over large sets of data stored in a distributed file system, said instructions comprising:
-
receiving one or more data sets stored in a distributed file system, said data sets comprising a plurality of time-stamped events; processing the data sets to eliminate any data corresponding to users identified as bots; specifying one or more temporal queries to be run in a Map-Reduce (M-R) application; fragmenting the temporal queries into a set of continuous queries (CQ); determining a partitioning key for each CQ from the data sets for use in partitioning the data sets and distributing the partitioned data sets to a cluster comprising a plurality of computing devices in a map phase of the M-R application; instantiating an embedded data stream management system (DSMS) having one or more DSMS-based algorithms for performing temporal analysis of data sets partitioned in the map phase of the M-R framework, said DSMS algorithms operating on the plurality of computing devices to output one or more events from the temporal analysis of the partitioned data; wherein output of the DSMS-based algorithms is written to an in-memory blocking queue, from which a reducer phase of the M-R application reads events synchronously relative to the time-stamp associated with each event, regardless of a DSMS output order for the events; wherein the reducer phase of the M-R framework completes the CQ of the partitioned data by processing the events that are synchronously read from the blocking queue by performing the same computation on each data partition in parallel each of the computing devices; and outputting the results of the completed CQs to generate the results of the temporal queries. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification