1. # Distributed Metrics for Conversion Model Evaluation

At Magnetic we use logistic regression and Vowpal Wabbit in order to determine the probability of a given impression resulting in either a click or a conversion. In order to decide which variables to include in our models, we need objective metrics to determine if we are doing a good job. Out of these metrics, only the computation of lift quality (in it’s exact form) is not easily parallelizable. In this post, I will show how the computation of lift quality can be re-ordered to make it distributable.

2. # Computing Distributed Groupwise Cumulative Sums in PySpark

When we work on modeling projects, we often need to compute the cumulative sum of a given quantity. At Magnetic, we are especially interested in making sure that our advertising campaigns spend their daily budgets evenly through out the day. To do this we need to compute cumulative sums of dollars spent through out the day in order to identify the moment at which a given campaign has delivered half of it’s daily budget. Another example where being able to compute a cumulative sum comes in handy is transforming a probability density function into a cumulative distribution function.

Because we deal with large quantities of data, we need to be able to compute cumulative sums in a distributed fashion. Unfortunately, most of the algorithms described in online resources do not work that well when groups are either: large (in which case we can run out of memory) or un-evenly distributed (in which case the largest group becomes the bottle neck).

3. # PySpark Carpentry: How to Launch a PySpark Job with Yarn-cluster

Using PySpark to process large amounts of data in a distributed fashion is a great way to gain business insights. However, the machine from which tasks are launched can quickly become overwhelmed. This article will show you how to run pyspark jobs so that the Spark driver runs on the cluster, rather than on the submission node.

4. # Bloom Filter-Assisted Joins with PySpark

One of the most attractive features of Spark is the fine grained control of what you can broadcast to every executor with very simple code. When I first studied broadcast variables my thought process centered around map-side joins and other obvious candidates. I’ve since expanded my understanding of just how much flexibility broadcast variables can offer.

5. # Installing Spark 1.5 on CDH 5.4

If you have not tried processing data with Spark yet, you should. It’s the next happening framework, centered around processing data up to 100x more efficiently than Hadoop, while leveraging the existing Hadoop’s components (HDFS and YARN). Since Spark is evolving rapidly, in most cases you will want to run the latest released version by the Spark community, rather than the version packaged with your Hadoop distribution. This guide will walk you through what it takes to get the latest version of Spark running on your cluster.