Articles by Thomas Gauthier

  1. Distributed Metrics for Conversion Model Evaluation

    At Magnetic we use logistic regression and Vowpal Wabbit in order to determine the probability of a given impression resulting in either a click or a conversion. In order to decide which variables to include in our models, we need objective metrics to determine if we are doing a good job. Out of these metrics, only the computation of lift quality (in it’s exact form) is not easily parallelizable. In this post, I will show how the computation of lift quality can be re-ordered to make it distributable.

  2. Computing Distributed Groupwise Cumulative Sums in PySpark

    When we work on modeling projects, we often need to compute the cumulative sum of a given quantity. At Magnetic, we are especially interested in making sure that our advertising campaigns spend their daily budgets evenly through out the day. To do this we need to compute cumulative sums of dollars spent through out the day in order to identify the moment at which a given campaign has delivered half of it’s daily budget. Another example where being able to compute a cumulative sum comes in handy is transforming a probability density function into a cumulative distribution function.

    Because we deal with large quantities of data, we need to be able to compute cumulative sums in a distributed fashion. Unfortunately, most of the algorithms described in online resources do not work that well when groups are either: large (in which case we can run out of memory) or un-evenly distributed (in which case the largest group becomes the bottle neck).

  3. PySpark Carpentry: How to Launch a PySpark Job with Yarn-cluster

    Using PySpark to process large amounts of data in a distributed fashion is a great way to gain business insights. However, the machine from which tasks are launched can quickly become overwhelmed. This article will show you how to run pyspark jobs so that the Spark driver runs on the cluster, rather than on the submission node.