Big Data with Apache Spark

written by Sean Law and Benjamin Zaitlen on 2018-03-13

This talk will introduce Apache Spark (one of the most popular big data tools), the different built ins (from SQL to ML), and, of course, everyone's favorite wordcount example. Once we've got the nice parts out of the way, we'll talk about some of the limitations and the work being undertaken to improve those limitations. We'll also look at the cases where Spark is more like trying to hammer a screw. Since we want to finish on a happy note, we will close out with looking at the new vectorized UDFs in PySpark 2.3.

If you have other Spark questions, even if they aren’t super related, feel free to bring them with you and we can spend the last while doing informal Q&A.


Holden Karau is a transgender Canadian open source developer advocate @Google with a focus on Apache Spark, BEAM, and related "big data" tools. She is the co-author of Learning Spark, High Performance Spark, and another Spark book that's a bit more out of date. She is a committer on the PMC for Apache Spark and committer on SystemML & Mahout projects. She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal.