Jump to ratings and reviews
Rate this book

Learning Spark: Lightning-Fast Big Data Analysis

Rate this book
The Web is getting faster, and the data it delivers is getting bigger. How can you handle everything efficiently? This book introduces Spark, an open source cluster computing system that makes data analytics fast to run and fast to write. You’ll learn how to run programs faster, using primitives for in-memory cluster computing. With Spark, your job can load data into memory and query it repeatedly much quicker than with disk-based systems like Hadoop MapReduce.

Written by the developers of Spark, this book will have you up and running in no time. You’ll learn how to express MapReduce jobs with just a few simple lines of Spark code, instead of spending extra time and effort working with Hadoop’s raw Java API.


Quickly dive into Spark capabilities such as collect, count, reduce, and save
Use one programming paradigm instead of mixing and matching tools such as Hive, Hadoop, Mahout, and S4/Storm
Learn how to run interactive, iterative, and incremental analyses
Integrate with Scala to manipulate distributed datasets like local collections
Tackle partitioning issues, data locality, default hash partitioning, user-defined partitioners, and custom serialization
Use other languages by means of pipe() to achieve the equivalent of Hadoop streaming

274 pages, Paperback

First published July 22, 2013

198 people are currently reading
740 people want to read

About the author

Holden Karau

14 books20 followers

Ratings & Reviews

What do you think?
Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars
149 (26%)
4 stars
255 (45%)
3 stars
130 (22%)
2 stars
24 (4%)
1 star
8 (1%)
Displaying 1 - 30 of 55 reviews
Profile Image for Jacek Laskowski.
12 reviews35 followers
September 9, 2015
Learning Spark from O'Reilly is a fun-Spark-tastic book! It has helped me to pull all the loose strings of knowledge about Spark together. The official documentation, articles, blog posts, the source code, StackOverflow gave me a fine start, but it was the book to make it all flow well. I'm much better equipped to understand the concepts of Apache Spark - RDDs, DataFrames, DStreams, driver vs executors, clusters, do's and dont’s, monitoring, a little of Machine Learning using MLlib, and much, much more. And there were only ca 250 pages! They're far too few to cover Apache Spark in-depth for sure, but the book did the great job to not be too lengthy (so people could get scared and run away when spot the book on a shelf in a bookstore) and at the same time cover enough details with areas for self-study when needed. The book is exactly what I can offer to anyone wishing to learn Spark and apply it with confidence to problems it was really meant to solve.

From a higher-level learning perspective the book follows a proven teaching path - start with the theory, show a few examples and explain the more advanced topics just a little (to whet my appetite enormously, though). It met my expectations fully, but, as it usually happens when I’m guided by very skilled teachers, my appetite grew so badly that I hated (and got mad) when the book finished.

The book is packed with plenty of examples, explanations, motivations, recommendations, tips that together with the writing flow and the layout, fonts, and such, made the book so pleasant to read. I’m into Apache Spark as a technology advocate for Apache Spark in deepsense.io and the book has just made my wish to get deeper into Spark even stronger. The book targets developers (Scala, Java, Python), data scientists, and administrators (with a little of Spark's clustering, monitoring and tuning).

The book is a fine example of what sort of books resonate well with me.

I was reading the book using Kindle on a Nexus 7 tablet and it read very well. The examples were well laid out, the fonts appropriate and in general the quality was excellent. It helped me convince myself to read more books in electronic format.

I’d really love to have a series of follow-up books devoted specifically for each topic alone - clustering (standalone, YARN, and Mesos), streaming and integration with external sources (Kafka and Flume comes to my mind), DataFrames and SQL, Machine Learning (Pipelines and algorithms in MLlib) and graph processing (GraphX). I think these could easily each have 200+ pages.

On the ​flip ​side, I would advocate for improving Chapter 9. Spark SQL as it was too shallow (even without a separate section on DataFrames) and there were very few examples. With the features of upcoming Spark 1.5 in, the book could easily remain the Spark book for the years to come. There’s no chapter about SparkR, either.

I think it’s going to be a tough exercise to beat Learning Spark content-wise! As a reader, however, I’d like to be wrong and encourage publishers to take up the challenge!
2 reviews
August 11, 2014
Warning: this is a review from an early release edition of the book, containing only about five of the planned thirteen chapters

An amazing introduction to Spark, written with help from the creators of this distributed computing framework. Definitely much better than just reading the documentation, especially if you are interested on PySpark, the Python API to Spark, which is currently under-documented and pretty much hit-and-miss.
Profile Image for Alex Ott.
Author 3 books207 followers
October 9, 2020
Overview of major functionality of the Spark, sometimes quite shallow, but that's good for people who just start. Major advantage is that book was updated for Spark 3.0 release that is quite new. Also, examples are both in Scala & Python.
One thing to take into account - there are bugs in some code examples, so don't wonder if code doesn't work out of box.
Profile Image for Paul.
338 reviews14 followers
March 13, 2021
This is a boring book about a boring library that does supremely important things. If I get into a situation where I need to use it, I'll finish it, but I've certainly gone far enough for now.
Profile Image for Vlad Ardelean.
157 reviews34 followers
November 18, 2020
Good introduction to Spark.

I don't consider this review worthy of reading, feel free to skip it. This book is purely technical, you'll find in it exactly what you expect.

The book contains code samples in Java, Scala and Python.

It covered spark 1.1 and 1.2 (and a few sections about 1.3). Other reviewers say this is quite outdated. The most current version at this time is 3.0.0

Here's a breakdown of the topics discussed (which I'm sure can be found online as well):

* Basic spark programming interface/concepts: RDDs, DataFrames, DStreams, common classes/methods; actions vs transformations;
* Slightly more advanced concepts: accumulators and broadcast variables
* Spark execution model (jobs, stages, tasks)
* Various installation options (the standalone mode, yarn, mesos, launching spark on ec2 in the standalone mode)
* Performance tuning
* Debugging
* Visualization of the spark installation status and execution stages
* SparkSQL/ HiveQL
* Spark streaming (with the Receiver programs)
* Introduction to MLlib (the machine learning library)

Profile Image for João.
64 reviews
June 6, 2021
It's a nice book. I read it after about 6 months of light experience with Spark. It certainly helped me understand the fundamentals of how Spark works and what it does. The book gives a general overview of the commands available with a stronger focus on Spark SQL. Chapter 5 felt a bit all over the place. The chapter on optimization of Spark applications, although not very much in depth, gave me the basic tools to make small adjustments to the code with high impact, and also the knowledge to get better at debugging Spark code.

I enjoyed the introductions to MLlib, Structured Streaming, storage solutions, and MLflow. These are good topics to cover for a beginner like myself.

Like other readers mention, it is a bit superficial, so I am not sure I will come back to it very often. But it is great to begin with, easy to read, and introduced me to a lot of things I want to explore in the future.
Profile Image for Gavin.
Author 2 books565 followers
July 17, 2018
Tool books are difficult to stomach: their contents are so much more ephemeral than other technical books. It's not worth it: in 10 years, will it matter? etc. (This is an incredibly high bar to pose, but that's how high my opinion is of the technical pursuits.) O'Reilly soften this blow, occasionally, by enlisting really brilliant authors who bring in the eternal and the broad while pootering around their narrow furrow. (I am incredibly fond of Alan Gates for this, for instance.)

Spark is the biggest deal by far in my corner of the world and will probably affect your life in minor ways you will never pin down (see O'Neil below).

[Theory #1, Thinking #1]
Profile Image for Nar Kumar Chhantyal.
8 reviews3 followers
December 25, 2017
I think it's really good book to get started with Spark. Since I already use Spark at work, I went quite fast as I already knew a lot of what is in the book. Still learned many new things.

It has code samples in Java, Scala and Python. It covers Spark 1.x, while Spark 2.x is already out with quite bit of changes. So you might have to adjust these samples.
1 review
June 26, 2017
The book is good as a starter kit but doesn't go too much in spark internals

The book is good as a starter kit but doesn't go too much in spark internals. The book is setup in a good manner and has a very smooth flow and can be read in one or two sitting
Profile Image for George.
91 reviews3 followers
September 2, 2019
Although a bit dated now, it still contains a really smooth introduction to the main concepts and operations, that will get you up to speed with Spark quite fast. I really liked the fact that all examples are in both Python, Scala, and Java.
Profile Image for Kirill.
5 reviews1 follower
July 4, 2017
A nice general overview of popular Spark components with examples in Scala, Java and Python. Python code does not respect PEP-8.
Profile Image for Kalyan Tirunahari.
126 reviews
September 30, 2019
Very Basic and simple examples to get into Spark. Targeted for newbies and the tuning and performance is not in depth as well. But overall quick and dirty way to dig into Spark.
Profile Image for Isaac.
71 reviews
May 20, 2020
Good, but like all tech books it's aging like cut lettuce in the sun.
Profile Image for Jijo.
3 reviews2 followers
April 25, 2021
Good starting point for beginners and intermediate levels.
76 reviews
July 3, 2025
3.5星吧,浅显易懂适合初学者,里面的代码用三种语言(py,scala,java)实现,介绍的也比较全面,能给人一个总体上的把握,感觉spark就两个主要概念:一切都是RDD,集群主从架构。PS:我看书太慢了,本想假期看完两本书,现在才看完一本😭 https://book.douban.com/subject/26616...
Profile Image for Erkin Unlu.
174 reviews26 followers
March 22, 2017
I've learned much more from online Moocs than this book.
Profile Image for Jascha.
151 reviews
December 24, 2017
Over the last few years Big Data has gathered an incredible amount of momentum. All this fuzz and buzz resulted in top companies, as well as fearless start-ups, to invest hours and cash in data solutions, some of which have emerged, establishing new standards. Having the spotlight on often resulted in these projects turning into open source ones. Among these , Spark, a cluster computing framework, recently adopted by the Apache Foundation. Despite being a hot topic of this 2015, the literature dedicated to the subject is still very limited. Among the few titles available, Learning Spark provides the curious reader with a decent overview of the major features provided by the framework.

Written by a groups of enthusiasts and developers, including the original creator of the framework itself, Matei, Learning Spark targets data scientists and engineers. As expressly written on the back cover, this book is neither a reference nor a cookbook. Its goal is to presents a different, faster alternative to the Hadoop’s Map/Reduce paradigm and to the elephant made in Apache itself.

The reader is given a quick overview of the capabilities of the framework, such as the built-in libraries, Spark SQL and the many different data sources it can interact with. While not all the main features are presented, those that are found within these almost three-hundreds pages come with plenty of well explained examples.

The examples are, on the other hand, one of the many perplexities raised by this text: each is presented in Python, Java and Scala. While it is great to see many different bindings in action, any average skilled Pythonist can easily understand what happens in Java . And vice versa. This is even more true in the case of Scala, another most wanted topic of the recent years, inevitably related to Java and its ecosystem.

Another thumb down for the complete absence of anything related to the Spark’s internal architecture. The car looks nice, but what about the engine? How does it work? Magic? Witchery?

Again, the examples presented are clear and well explained, but there is no real world case shown. Spark is meant to get executed on huge clusters with scary amounts of data. True, this is a quick overview of the product, but “hello world” per se does not make me wanna learn more.

Overall, a good read for that early morning hour of commute. It helps the curious reader to pickup the basics of the framework. On the other hand, nothing of what is presented can’t be found in the web pages of the Apache Software Foundation.
Profile Image for Todd N.
357 reviews255 followers
April 8, 2015
Very good overview of Spark and guided tour through the APIs of its major components (GraphX being the notable exception).

I preordered this book and finally got a chance to read it over spring break. Something at this technical level is just what the Spark project has needed for a long time.

I’ve set up Spark as a YARN application for clients, but never really dug in to how it worked or what it was capable of. And at work I’ve queried Spark SQL’s over JDBC, but without much understanding of what was really going on underneath.

Coming to this book with a fairly good understanding of Hadoop, I was struck by how simple and powerful the Spark API is. Also, I like how it covers in its components many of the things that Hadoop takes entire distributions of projects to do. (Though I guess there is no way to get around using ZooKeeper.)

It’s also smart the way that Spark is able to leverage HDFS and S3, which enterprises have been throwing their data in for at least half a decade now.

Spark is evolving quickly, so I’m not sure how well this book will age. (For example, SchemaRDDs are already renamed to Data Frames.) But the fundamentals covered are sure to be solid for a while.

If you have a background in Hadoop, this will be an easy read. If not, maybe reading an overview of HDFS and Hive first would help (and maybe S3). Maybe working through a word count map-reduce tutorial would be useful too, just for background knowledge.

I loved all the Python API examples because that’s the only language that I truly feel comfortable in. Java was familiar again from Hadoop. I don’t know Scala, but it looks like Python and Java had a weird baby or something, so I’ll check it out.

The chapter on MLLib won’t make too much sense unless you understand what the algorithms mean or you are reading it with some sort of reference open. Still, the examples are clear enough to give a flavor. I didn’t get a sense of how to actually evaluate models, but I doubt anyone would use MLLib without using the real online documentation.

Highly recommended.
Profile Image for Jin Shusong.
78 reviews1 follower
February 20, 2017
The book is good for beginners of Spark. The author is one of the team members who wrote spark system. So, the intuitive thinking and concepts are definitely correct. This is very important for the beginners.
61 reviews2 followers
July 31, 2018

Data in all domains is getting bigger. How can you work with it efficiently? Recently updated for Spark 1.3, this book introduces Apache Spark, the open source cluster computing system that makes data analytics fast to write and fast to run. With Spark, you can tackle big datasets quickly through simple APIs in Python, Java, and Scala. This edition includes new information on Spark SQL, Spark Streaming, setup, and Maven coordinates.

Written by the developers of Spark, this book will have data scientists and engineers up and running in no time. You’ll learn how to express parallel jobs with just a few lines of code, and cover applications from simple batch jobs to stream processing and machine learning.


Quickly dive into Spark capabilities such as distributed datasets, in-memory caching, and the interactive shell
Leverage Spark’s powerful built-in libraries, including Spark SQL, Spark Streaming, and MLlib
Use one programming paradigm instead of mixing and matching tools like Hive, Hadoop, Mahout, and Storm
Learn how to deploy interactive, batch, and streaming applications
Connect to data sources including HDFS, Hive, JSON, and S3
Master advanced topics like data partitioning and shared variables




**

Profile Image for Michael Koltsov.
110 reviews69 followers
November 13, 2015
This particular book should be included if Spark will eventually get a nice and shiny box version with caps and T-shirts inside. What more can I say? This book is partly written by the creator of Spark himself, hence it should be treated as a comprehensive and succinct manual which unfortunately it doesn’t have as of today (for free).

Luckily, if you’ve spent a relatively small amount of money to buy/read/learn/try all examples you will know two times more than a typical Spark developer with 1 year of experience under his belt.

The only problem I had with this book is that some of the examples don’t work because the book itself hasn’t been updated to be on track with the most recent version of Spark (which differs a lot from the version described in the book).

Score 4/5
Profile Image for Nitish Sheoran.
1 review
August 5, 2016
When I started this book, I was basically looking for a book which can give me a good introduction to Apache spark and pyspark. This book was quite a good learning experience with programming examples with more emphasis on Java/Scala than Python. Pyspark is still less developed than Scala/Java, but there are still places where more detailed python examples can be included. So in conclusion, one of the best book for introducing Apache Spark and learning Spark using Java/Scalain market but lagging behind in its pyspark concepts.
Displaying 1 - 30 of 55 reviews

Can't find what you're looking for?

Get help and learn more about the design.