Sparkling Pandas - using Apache Spark to scale Pandas

02:15 PM - 03:00 PM on August 16, 2014, Room 702

Holden Karau

Audience level:
intermediate
Watch:
http://youtu.be/AcyI_V8FeIU

Description

Co-presented with Juliet Hougland.

Pandas is a fast and expressive library for data analysis that doesn’t naturally scale to more data than can fit in memory. PySpark is the Python API for Apache Spark that is designed to scale to huge amounts of data but lacks the natural expressiveness of Pandas. We will introduce Sparkling Pandas, a new library that brings together the best features of Pandas and PySpark; Expressiveness, speed, and scalability.

Abstract

Sparkling Pandas (Pandas on PySpark)

Pandas is a fast and expressive library for data analysis that doesn’t naturally scale to more data than can fit in memory. PySpark is the Python API for Apache Spark that is designed to scale to huge amounts of data but lacks the natural expressiveness of Pandas. We will introduce Sparkling Pandas, a new library that brings together the best features of Pandas and PySpark; Expressiveness, speed, and scalability.

Spark frames collections of data as Resilient Distributed Datasets (RDDs) which have a simple functional API. In order to use Pandas DataFrames in PySpark without Sparkling Pandas you could represent each partition of your RDD as a DataFrame, and then use the functional API for RDDs to perform operations. The resulting syntax, rdd.map(lambda x: x.map(lambda y: z)), leaves something to be desired.

Sparkling Pandas allows for (almost) treating the RDD itself as a DataFrame. We do this by extending the basic Spark RDD to be aware of the underlying Panda DataFrames and implementing useful functionality on our specialized RDD. Using this extended RDD we have implemented some simple operations. We will examine some common methods and how the performance differs in a distributed setup. We will also cover how to load data effectively into Pandas using both SparkSQL and using Spark's file load mechanism combined with Pandas native parsing. This gives us the ability to load data from various file formats, like CSV and Parquet as well as from Apache Hive.

With time permitting we will then cover some of the implementation details as we feel these implementation details help people understand the performance improvement Sparkling Pandas provides. We believe this will increase the chances we don’t have to write the rest of the library by ourselves. Join us!