# Click Prediction with Vowpal Wabbit

At the core of our automated campaign optimization algorithms lies a difficult problem: predicting the outcome of an event before it happens. With a good predictor, we can craft algorithms to maximize campaign performance, minimize campaign cost, or balance the two in some way. Without a good predictor, all we can do is hope for the best.

# The Problem

Before we can talk about how we build a predictor, it will be instructive to discuss the problem space we’re operating in. Our real-time bidding servers consume hundreds of thousands of bid requests per second, in which we are told that there is an opportunity to show an ad to a particular user on a particular page.

To decide which ad (if any) to show, we need to consider all of the campaigns we’re running on behalf of our clients, which explodes the cardinality of decisions by several orders of magnitude. If we decide we do want to show an ad, we return a bid response indicating the ad and how much we’d like to pay. If we’re the highest bidder, we win.

Our predictor, then, needs to be incredibly fast, and only has access to the information we have at bid time — things like the user agent string, IP address (from which we can infer geographic location), and additional data we know about the user based on a cookie lookup in a database, etc. We also mix in information about the campaign (since the click-through rate varies widely from one campaign to another, depending on the popularity of the product being advertised, the quality of the ads, etc).

With all this information, the predictor then has to answer a (seemingly) simple question: what is the estimated click probability of this impression, if we bid on it and win?

# Model Training

We researched the modeling libraries available to us, and decided to use Vopal Wabbit to build our models. VW is not actually a single machine learning module, but a package of different modeling algorithms that share a common interface for training and evaluating new examples. Thus, training is a problem of selecting a model type, model parameters, and iterating on these decisions until we achieve a satisfactory resulting model. VW also offers a C library interface which lets us use the model seamlessly from our real-time services, which are written in Python.

We chose to use a logistic regression model, using some of the features described above, as well as a variety of “quadratic” features (feature pairs that VW internally manages) to train the model. We selected quadratic features based both on Information gain ratio and the gain in performance of the model in testing scenarios.

# Calibration

We have stated earlier that we need to know the probability of a click in an auction scenario. Unfortunately, the scores produced by VW look nothing like probabilities, as they range from -50 to 50. If the model has learned well, events with higher scores are more likely to result in a click than events with lower scores, but the absolute values of the scores are not, by themselves, meaningful. Calibration is the process by which we turn the raw scores into a more useful form — the probability of a click.

Our calibration function must have the following properties for the best possible outcomes when we use the model:

- Monotonicity: events with higher raw scores should have higher calibrated probabilities
- Continuity: events with similar raw scores should have similar predicted probabilities

We considered several candidate functions for calibrating the raw scores into probabilities:

- The sigmoid function creates a smooth S-shaped curve that satisfies both conditions, but it offers no control over the outputted probabilities.
- Logistic regression creates a function similar in shape to the sigmoid function, but possibly shifted and squished. Unfortunately, it has too few degrees of freedom and does not capture the variations in click probability well enough.
- Isotonic regression creates a step function that performs well, but is neither monotonically increasing nor continuous, and common implementations are exceedingly slow to train over millions of examples.

*Logistic and Isotonic Regression
vs. Actual CTR. Isotonic regression produces a good fit, but is slow to
calculate and is not monotonically increasing. Logistic regression doesn’t
sufficiently capture the noisy nature of CTR data.*

Thus, we developed our own calibration approach, “MIMIC” (Multi-Interval Monotonically Increasing Continuous) Calibration. MIMIC is similar to isotonic regression, in that it creates a piece-wise function. But unlike isotonic regression, it is both monotonically increasing and continuous. We achieve this by dividing the input calibration examples into contiguous ranges by score, and fitting a line to each range that meets at the range boundaries:

- For all training examples, compute the uncalibrated score using VW, and note whether the example was a click or non-click.
- Sort the training examples by raw score in increasing order.
- Create groups of consecutive examples over the sorted set such that each group contains exactly five clicks, and compute the CTR for each group.
- Note that because we only constrain the number of clicks in each group, it’s
possible that the CTR of adjacent groups at this point is not monotonically
increasing. So, for as long as this condition holds, merge adjacement groups
where the left-hand group’s CTR is lower than the right-hand group’s. This may
take several passes over the groups, and is O(N
^{2}) in the worst case, but N here is significantly smaller than the number of training examples. In practice, we find that this reduces the number of groups from an initial set of a few thousand to just a few hundred. - For each range boundary, compute the (X, Y) points that will become our final function, where X is the raw score of the boundary, and Y is the mean CTR of the groups on either side of the boundary.

*MIMIC vs. Actual CTR.
MIMIC closely captures the true shape of the CTR “curve”, and is monotonically increasing.*

To apply the calibration function given a raw score, look up the group in which that raw score exists, and linearly interpolate across the line segment recorded for that group. We use binary search to do this efficiently.

# Quality Metrics

We evaluate our models using Lift Quality, which is similar to the more widely known ROC AUC metric. L-Quality measures how well the model ranks events: if clicks are generally given higher scores than non-clicks, L-Quality will be high; if scores for clicks and non-clicks share the same distribution (e.g. the predictor has learned nothing about the data, and is ranking events randomly), L-Quality will be close to 0.

L-Quality measures only how well the model *sorts* the events in the order of
click likelihood, achieving a higher quality score when proportionally more of the
clicks in the data set are given high scores. It is important to note that
L-Quality does not tell us anything about the effectiveness of the
calibration model, which is to say, how closely our calibrated model
actually predicts reality.

Therefore, we also use Total Variation Distance which is, essentially, the L1 distance of the predicted click proability and the actual observed click or no-click result. This metric helps give us insight into how well MIMIC is performing in the wild.

# Using VW from Python

One additional challenge with VW is that it is designed to be used by researchers in an offline setting, but we need to evaluate the model at run-time from our Python bidding application.

Fortunately, VW is written in C++ and exposes a C wrapper layer that we can call from our Python application. Because we use PyPy, and because we don’t want to write C code if we can help it, we implemented the “driver” for VW model evaluation using cffi, a part of the PyPy project.

Writing a cffi wrapper is fairly straightforward. You tell cffi about the bits of the library’s API that you need to use, and how to compile and link with it; it then makes available a Python-level object with methods in place of the functions from the C API which you can use to write your wrapper code. For VW, we use the following:

```
from cffi import FFI
ffi = FFI()
# Declare the functions from the C API we'll use
ffi.cdef("""
void* VW_InitializeA(char *);
void* VW_ReadExampleA(void*, char*);
float VW_Predict(void*, void*);
void VW_FinishExample(void*, void*);
void VW_Finish(void*);
""")
# And write a small C program that cffi can use
# to verify it links successfully
ffilib = ffi.verify("""
typedef short char16_t;
#define bool int
#define true (1)
#define false (0)
#include "vwdll.h"
""",
include_dirs=["/usr/include/vowpalwabbit/"],
library_dirs=["/usr/lib64"],
libraries=["vw_c_wrapper", "vw", "allreduce"],
ext_package="vw"
)
class VWModel(object):
def __init__(self, model_file, metadata, vw_cmd):
self._vw = ffilib.VW_InitializeA(vw_cm))
if not self._vw:
raise RuntimeError("VW Model failed to initialize")
def close(self):
if self._vw:
ffilib.VW_Finish(self._vw)
self._vw = None
def getScore(self, example_str):
example = ffilib.VW_ReadExampleA(self._vw, example_str)
score = ffilib.VW_Predict(self._vw, example)
ffilib.VW_FinishExample(self._vw, example)
return score
```

With this wrapper, our application code can treat VW as though it were any other Python object, creating, storing, and using instances as necessary.