Mon 17 Aug 2015

Measuring Statistical Lift on Search Categories

One of the most popular features of the Magnetic Insight platform is our category rankings for an advertiser’s audience of page visitors. The rankings give a completely unbiased look into which search categories are the most popular amongst the users that visit a customer’s different web pages.

A category lift report from Magnetic Insight

This report, the centerpiece of Magnetic Insight, shows interest and intent for our customers’ site visitors based on their off-site search activity, and the keyword categories we programmatically assign to each search. Among our advertisers, there is a lot of curiosity over how this number is calculated and why and how it changes day by day. This post will dig into the gritty implementation details of both the math behind this analysis and data processing approaches we use to handle data volume.

So how do we go from raw search event and page view data to the dolled-up category indices you see above?

Statistical Lift

The per-category results are based on statistical lift. In broad terms, lift describes how much more or less likely you are to take a target action given that you have a specific preceding action.

Expressed mathematically as a conditional probability, the lift of target action A given precondition B is:

Lift(A|B) = P(A|B) / P(A)

For a simple example, suppose 20% of people wear mittens, but 60% of children wear mittens. In this scenario, the lift is 3: being a child increases the chance one wears mittens by a factor of 3.

Lift(mittens|child) = 0.6 / 0.2 = 3

In our case, we’re interested in the lift to a page visit given that the end user has a search in a given category, or:

Lift(category|visit) = P (category|visit) / P(category)

In other words, instead of mittens and children, we’re looking at page visitors and users who’ve searched in a particular category. To extend the example, if 20% of the users overall have searches in the “Automotive” category, but 60% of the visitors to a particular advertiser site have automotive-related searches in their history, we again end up with a lift factor of 3. We interpret this by saying that “visitors to this site are 3 times more likely than the average user to have searched for automotive-related terms.”

Calculating Lift at Magnetic

To start we look at 30 days worth of search event data and page views. For the search events, we filter the input down to unique combinations of cookie and category. We perform the same action on page views, except instead of unique combinations of cookie and category, we look at unique combinations of cookie and advertiser site. We compute distinct combinations because the lift formula is phrased in terms of unique visitors, so a full history, including duplicates, for each user is not needed.

Next we have to compute the cardinalities of the different sets, in order to compute the probabilities on which the lift computation is built:

This can be visualized by:

Venn Diagram visualization of the different sets of
users

We’ve found through trial and error that we only see statistically relevant results for advertiser + category combinations where vc_users >= 5. So for a particular category c if vc_users < 5 we filter out the record.

For a given v and c combination, the lift calculation can be computed as:

Lift(v|c) = (vc_users / v_users) / (c_users / all_users)

which can rewritten as:

Lift(v|c) = (all_users * vc_users) / (c_users * v_users)

This is referred to as the maximum likelihood estimation. It is an easy first step to compute, however it can be inaccurate if the behavior and/or condition are rare.

Moreover, since we use the lift to quantify the strength of the relationship between the condition and the behavior, we should be worried about Type I errors, i.e., false positives: random flukes which do not reflect the underlying causality and which would disappear given more observations.

This means that usually we are not interested in the most precise possible estimate of the lift, the MLE, but rather in the lower estimate. In other words, instead of saying that lift is precisely 3, we want to say that we are 95% sure that the lift is more than 2, for instance. Therefore we should be looking at the the confidence interval for the probability estimate instead of the max likelihood estimate above.

Specifically, we use the wilson confidence interval method, which is known to be accurate even when the number of trials is very small.

So to redefine lift in terms of a safer probabilistic measure let’s take the ratio of the lower estimation to the higher estimation using the 95% confidence threshold.

The code below shows how we define both the lower and higher estimations in terms of our four inputs for the given pixel and category.

def lift_lower_bound(vc_users, c_users, v_users, all_users):
    (c1, w1) = wilson(vc_users, c_users)
    (c2, w2) = wilson(v_users, all_users)
    return (c1 - w1) / (c2 + w2)

def lift_upper_bound(vc_users, c_users, v_users, all_users):
    (c1, w1) = wilson(vc_users, c_users)
    (c2, w2) = wilson(v_users, all_users)
    return (c1 + w1) / (c2 - w2)

We just have one final transformation to perform now that we have our measurement in terms of high and low estimations. If both the high and low estimation are > 1, we record the lower as the lift metric. If both the high and the low are <= 1, we use the high estimation. If the high is > 1 and the low <= 1, we flag the lift metric for that particular combination as unknown. This is how we maintain very conservative guesses in terms of the lift scores.

Scaling Lift Computation

Although the mathematics are relatively simple, computing lift over billions of search and page visit events is somewhat challenging, due to the size of the data sets.

We have experimented with both Pig and Redshift to join around 8 billion search events and 700 million page views by cookie in order to determine vc_users for every combination. The process of joining on cookie and performing a distinct operation against the resulting relation takes more than an hour in Pig and is computed once a day. Redshift was significantly faster taking only about 8 minutes even with unsorted tables, but it added significant overhead in managing the uptime and downtime of the remote cluster to minimize cost of operation.

One obvious route for performance optimization of the lift calculation is to use approximate cardinalities for the sets involved in the lift computation. In particular, we are investigating the HyperLogLog data structure, which approximates the count of distinct elements in a set, with a small error rate. The lift analysis is tolerant of error in the various cardinality measurements. If v_users, c_users or vc_users is off by 1-2% it won’t have too dramatic an effect on the final index.

The tricky part is the intersection for vc_users. HyperLogLogs can be combined to estimate the cardinality of distinct elements in the union, but not in the intersection — the vc_users in our calculations here. Techniques exist for estimating the size of the intersection — the Inclusion-Exclusion_formula — but our data may not satisfy its requirements, specifically that the cardinality of the sets being intersected may be orders of magnitude different.

If we find the error percentage is too high for our intersections, then we plan to implement the more robust but complex AdRoll minhash technique for measurement.

Tags: data science, keyword categorization

We are Hiring!

We have in Engineering, including:

Apply online and see all positions at our job board.