Thread by @CMastication, This week I’ve fallen down a rabbit hole with learning about Spark. [...]

JD Long

CMastication

This week I’ve fallen down a rabbit hole with learning about Spark. And I’ve also been reading _The Design of Everyday Things_, coincidentally. I’ve found Spark really engaging because of a few interesting design decisions. The “on ramp” to Spark is super easy. 1) read in data...

2) query data. This is possible because Spark is declarative. You tell it what you want, and it goes and figures out how. Spark has multiple APIs at different “granularity”. The easy on ramp is the data frame api. Treat a Spark table as a Pandas df, R df, or SQL table and...

off you go, it “just works”. And there are lots of “hints” and levers to pull which may improve performance even of the dataframe API. So SQL, R, Py, or Scala are the meta language offering up familiar objects to manipulate. And one need not know details of map/reduce...

But there’s a lower level ApI that gives more control. That was the original Spark API but I suspect a small % of users ever touch it. It allows manipulation of the data objects (RDDs) and so some call it the RDD API. Having multiple ways to use a product is really cool...

This allows lots of experimentation and lots of “quick success” as one learns. Honestly it feels like accidental gamification: “I have stuff that works (yeah!) but I *think* I can make it 50% faster. Let’s play this level again!” Iteration keeps my attention...

This is a powerful (and under appreciated) feature.

More on the multiple levels of APIs in Spark in the link below. There’s a third level (datasets) that’s not so interesting to me since I’m allergic to the JVM. https://medium.com/quantyca/apache-spark-how-to-choose-the-correct-data-abstraction-8df7c6d8ec63">https://medium.com/quantyca/...

Apache Spark: how to choose the correct data abstraction?

Apache Spark offers three different APIs to handle sets of data: RDD, DataFrame, and Dataset. Picking up the correct data abstraction is…

https://medium.com/quantyca/apache-spark-how-to-choose-the-correct-data-abstraction-8df7c6d8ec63

You can follow @CMastication.

Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: