Articles tagged big data

  1. Bloom Filter-Assisted Joins with PySpark

    One of the most attractive features of Spark is the fine grained control of what you can broadcast to every executor with very simple code. When I first studied broadcast variables my thought process centered around map-side joins and other obvious candidates. I’ve since expanded my understanding of just how much flexibility broadcast variables can offer.

  2. Embarrassingly Serial

    The past decade has seen a surge in technologies around “big data,” claiming to make it easy to process large data sets quickly, or at least scalably, by distributing work across a cluster of machines. This is not a story of success with a big data framework. This is a story of a small data set suffering at the hands of big data assumptions, and a warning to developers to check what your big data tools are doing for you.

  3. One-Pass Distributed Random Sampling

    One of the important factors that affects efficiency of our predictive models is the recency of the model. The earlier our bidders get new version of prediction model, the better decisions they can make. Delays in producing the model result in lost money due to incorrect predictions.

    The slowest steps in our modeling pipeline are those that require manipulating the full data set — multiple weeks worth of data. Our sampling process has historically required two full passes over the data set, and so was an obvious target for optimization.