Rate this book

Learning Spark: Lightning-Fast Big Data Analysis

Name: Learning Spark: Lightning-Fast Big Data Analysis
Rating: 3.91 (55 reviews)
ISBN: 9781449358624

Holden Karau, Andy Konwinski, Patrick Wendell

Rate this book

The Web is getting faster, and the data it delivers is getting bigger. How can you handle everything efficiently? This book introduces Spark, an open source cluster computing system that makes data analytics fast to run and fast to write. You’ll learn how to run programs faster, using primitives for in-memory cluster computing. With Spark, your job can load data into memory and query it repeatedly much quicker than with disk-based systems like Hadoop MapReduce.

Written by the developers of Spark, this book will have you up and running in no time. You’ll learn how to express MapReduce jobs with just a few simple lines of Spark code, instead of spending extra time and effort working with Hadoop’s raw Java API.

Quickly dive into Spark capabilities such as collect, count, reduce, and save
Use one programming paradigm instead of mixing and matching tools such as Hive, Hadoop, Mahout, and S4/Storm
Learn how to run interactive, iterative, and incremental analyses
Integrate with Scala to manipulate distributed datasets like local collections
Tackle partitioning issues, data locality, default hash partitioning, user-defined partitioners, and custom serialization
Use other languages by means of pipe() to achieve the equivalent of Hadoop streaming

GenresProgrammingTechnologyComputer ScienceTechnicalNonfictionSoftwareComputers

274 pages, Paperback

First published July 22, 2013

198 people are currently reading

740 people want to read

About the author

Holden Karau

14 books20 followers

What do you think?

Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars

149 (26%)

4 stars

255 (45%)

3 stars

130 (22%)

2 stars

24 (4%)

1 star

8 (1%)

Displaying 1 - 30 of 55 reviews

Alex Ott

Author 3 books207 followers

April 1, 2015

Quite good introduction into the Spark - covers all components, and not so outdated - book covers 1.1 + parts of 1.2

big-data ir-dm-nlp-ml-search

Jacek Laskowski

12 reviews35 followers

September 9, 2015

Learning Spark from O'Reilly is a fun-Spark-tastic book! It has helped me to pull all the loose strings of knowledge about Spark together. The official documentation, articles, blog posts, the source code, StackOverflow gave me a fine start, but it was the book to make it all flow well. I'm much better equipped to understand the concepts of Apache Spark - RDDs, DataFrames, DStreams, driver vs executors, clusters, do's and dont’s, monitoring, a little of Machine Learning using MLlib, and much, much more. And there were only ca 250 pages! They're far too few to cover Apache Spark in-depth for sure, but the book did the great job to not be too lengthy (so people could get scared and run away when spot the book on a shelf in a bookstore) and at the same time cover enough details with areas for self-study when needed. The book is exactly what I can offer to anyone wishing to learn Spark and apply it with confidence to problems it was really meant to solve.

From a higher-level learning perspective the book follows a proven teaching path - start with the theory, show a few examples and explain the more advanced topics just a little (to whet my appetite enormously, though). It met my expectations fully, but, as it usually happens when I’m guided by very skilled teachers, my appetite grew so badly that I hated (and got mad) when the book finished.

The book is packed with plenty of examples, explanations, motivations, recommendations, tips that together with the writing flow and the layout, fonts, and such, made the book so pleasant to read. I’m into Apache Spark as a technology advocate for Apache Spark in deepsense.io and the book has just made my wish to get deeper into Spark even stronger. The book targets developers (Scala, Java, Python), data scientists, and administrators (with a little of Spark's clustering, monitoring and tuning).

The book is a fine example of what sort of books resonate well with me.

I was reading the book using Kindle on a Nexus 7 tablet and it read very well. The examples were well laid out, the fonts appropriate and in general the quality was excellent. It helped me convince myself to read more books in electronic format.

I’d really love to have a series of follow-up books devoted specifically for each topic alone - clustering (standalone, YARN, and Mesos), streaming and integration with external sources (Kafka and Flume comes to my mind), DataFrames and SQL, Machine Learning (Pipelines and algorithms in MLlib) and graph processing (GraphX). I think these could easily each have 200+ pages.

On the flip side, I would advocate for improving Chapter 9. Spark SQL as it was too shallow (even without a separate section on DataFrames) and there were very few examples. With the features of upcoming Spark 1.5 in, the book could easily remain the Spark book for the years to come. There’s no chapter about SparkR, either.

I think it’s going to be a tough exercise to beat Learning Spark content-wise! As a reader, however, I’d like to be wrong and encourage publishers to take up the challenge!

apache-spark

Alessandro Andrioni

2 reviews

August 11, 2014

Warning: this is a review from an early release edition of the book, containing only about five of the planned thirteen chapters

An amazing introduction to Spark, written with help from the creators of this distributed computing framework. Definitely much better than just reading the documentation, especially if you are interested on PySpark, the Python API to Spark, which is currently under-documented and pretty much hit-and-miss.

ebooks

Alex Ott

Author 3 books207 followers

October 9, 2020

Overview of major functionality of the Spark, sometimes quite shallow, but that's good for people who just start. Major advantage is that book was updated for Spark 3.0 release that is quite new. Also, examples are both in Scala & Python.
One thing to take into account - there are bugs in some code examples, so don't wonder if code doesn't work out of box.

big-data

Paul

338 reviews14 followers

March 13, 2021

This is a boring book about a boring library that does supremely important things. If I get into a situation where I need to use it, I'll finish it, but I've certainly gone far enough for now.

Vlad Ardelean

157 reviews34 followers

November 18, 2020

Good introduction to Spark.

I don't consider this review worthy of reading, feel free to skip it. This book is purely technical, you'll find in it exactly what you expect.

The book contains code samples in Java, Scala and Python.

It covered spark 1.1 and 1.2 (and a few sections about 1.3). Other reviewers say this is quite outdated. The most current version at this time is 3.0.0

Here's a breakdown of the topics discussed (which I'm sure can be found online as well):

* Basic spark programming interface/concepts: RDDs, DataFrames, DStreams, common classes/methods; actions vs transformations;
* Slightly more advanced concepts: accumulators and broadcast variables
* Spark execution model (jobs, stages, tasks)
* Various installation options (the standalone mode, yarn, mesos, launching spark on ec2 in the standalone mode)
* Performance tuning
* Debugging
* Visualization of the spark installation status and execution stages
* SparkSQL/ HiveQL
* Spark streaming (with the Receiver programs)
* Introduction to MLlib (the machine learning library)

João

64 reviews

June 6, 2021

It's a nice book. I read it after about 6 months of light experience with Spark. It certainly helped me understand the fundamentals of how Spark works and what it does. The book gives a general overview of the commands available with a stronger focus on Spark SQL. Chapter 5 felt a bit all over the place. The chapter on optimization of Spark applications, although not very much in depth, gave me the basic tools to make small adjustments to the code with high impact, and also the knowledge to get better at debugging Spark code.

I enjoyed the introductions to MLlib, Structured Streaming, storage solutions, and MLflow. These are good topics to cover for a beginner like myself.

Like other readers mention, it is a bit superficial, so I am not sure I will come back to it very often. But it is great to begin with, easy to read, and introduced me to a lot of things I want to explore in the future.

Gavin

Author 2 books565 followers

July 17, 2018

Tool books are difficult to stomach: their contents are so much more ephemeral than other technical books. It's not worth it: in 10 years, will it matter? etc. (This is an incredibly high bar to pose, but that's how high my opinion is of the technical pursuits.) O'Reilly soften this blow, occasionally, by enlisting really brilliant authors who bring in the eternal and the broad while pootering around their narrow furrow. (I am incredibly fond of Alan Gates for this, for instance.)

Spark is the biggest deal by far in my corner of the world and will probably affect your life in minor ways you will never pin down (see O'Neil below).

[Theory #1, Thinking #1]

Nar Kumar Chhantyal

8 reviews3 followers

December 25, 2017

I think it's really good book to get started with Spark. Since I already use Spark at work, I went quite fast as I already knew a lot of what is in the book. Still learned many new things.

It has code samples in Java, Scala and Python. It covers Spark 1.x, while Spark 2.x is already out with quite bit of changes. So you might have to adjust these samples.

Siddhant

1 review

June 26, 2017

The book is good as a starter kit but doesn't go too much in spark internals

The book is good as a starter kit but doesn't go too much in spark internals. The book is setup in a good manner and has a very smooth flow and can be read in one or two sitting

George

91 reviews3 followers

September 2, 2019

Although a bit dated now, it still contains a really smooth introduction to the main concepts and operations, that will get you up to speed with Spark quite fast. I really liked the fact that all examples are in both Python, Scala, and Java.

Michael Koltsov

110 reviews69 followers

May 28, 2017

The book gives only a shallow knowledge of Spark

Kirill

5 reviews1 follower

July 4, 2017

A nice general overview of popular Spark components with examples in Scala, Java and Python. Python code does not respect PEP-8.

Pradeep

4 reviews

September 10, 2017

Great introduction to Spark.

4 reviews2 followers

Great book on Spark.

126 reviews

Very Basic and simple examples to get into Spark. Targeted for newbies and the tuning and performance is not in depth as well. But overall quick and dirty way to dig into Spark.

tech

Isaac

71 reviews

May 20, 2020

Good, but like all tech books it's aging like cut lettuce in the sun.

Lincoln Karuhanga

21 reviews

February 9, 2021

Great introduction.
Could have gone a little more into deployment and cluster management.

Jijo

3 reviews2 followers

April 25, 2021

Good starting point for beginners and intermediate levels.

Nicklas Bekkevold

34 reviews1 follower

July 10, 2023

I did not like the formatting

5 reviews

But dry

76 reviews

3.5星吧，浅显易懂适合初学者，里面的代码用三种语言(py,scala,java)实现，介绍的也比较全面，能给人一个总体上的把握，感觉spark就两个主要概念：一切都是RDD，集群主从架构。PS：我看书太慢了，本想假期看完两本书，现在才看完一本😭 https://book.douban.com/subject/26616...

Erkin Unlu

174 reviews26 followers

March 22, 2017

I've learned much more from online Moocs than this book.

Jacek

23 reviews2 followers

October 15, 2019

Polish translation is unbelievably bad

my-data-science-challenge

Jascha

151 reviews

December 24, 2017

Over the last few years Big Data has gathered an incredible amount of momentum. All this fuzz and buzz resulted in top companies, as well as fearless start-ups, to invest hours and cash in data solutions, some of which have emerged, establishing new standards. Having the spotlight on often resulted in these projects turning into open source ones. Among these , Spark, a cluster computing framework, recently adopted by the Apache Foundation. Despite being a hot topic of this 2015, the literature dedicated to the subject is still very limited. Among the few titles available, Learning Spark provides the curious reader with a decent overview of the major features provided by the framework.

Written by a groups of enthusiasts and developers, including the original creator of the framework itself, Matei, Learning Spark targets data scientists and engineers. As expressly written on the back cover, this book is neither a reference nor a cookbook. Its goal is to presents a different, faster alternative to the Hadoop’s Map/Reduce paradigm and to the elephant made in Apache itself.

The reader is given a quick overview of the capabilities of the framework, such as the built-in libraries, Spark SQL and the many different data sources it can interact with. While not all the main features are presented, those that are found within these almost three-hundreds pages come with plenty of well explained examples.

The examples are, on the other hand, one of the many perplexities raised by this text: each is presented in Python, Java and Scala. While it is great to see many different bindings in action, any average skilled Pythonist can easily understand what happens in Java . And vice versa. This is even more true in the case of Scala, another most wanted topic of the recent years, inevitably related to Java and its ecosystem.

Another thumb down for the complete absence of anything related to the Spark’s internal architecture. The car looks nice, but what about the engine? How does it work? Magic? Witchery?

Again, the examples presented are clear and well explained, but there is no real world case shown. Spark is meant to get executed on huge clusters with scary amounts of data. True, this is a quick overview of the product, but “hello world” per se does not make me wanna learn more.

Overall, a good read for that early morning hour of commute. It helps the curious reader to pickup the basics of the framework. On the other hand, nothing of what is presented can’t be found in the web pages of the Apache Software Foundation.

bigdata bigdata-spark data-analysis

Todd N

357 reviews255 followers

April 8, 2015

Very good overview of Spark and guided tour through the APIs of its major components (GraphX being the notable exception).

I preordered this book and finally got a chance to read it over spring break. Something at this technical level is just what the Spark project has needed for a long time.

I’ve set up Spark as a YARN application for clients, but never really dug in to how it worked or what it was capable of. And at work I’ve queried Spark SQL’s over JDBC, but without much understanding of what was really going on underneath.

Coming to this book with a fairly good understanding of Hadoop, I was struck by how simple and powerful the Spark API is. Also, I like how it covers in its components many of the things that Hadoop takes entire distributions of projects to do. (Though I guess there is no way to get around using ZooKeeper.)

It’s also smart the way that Spark is able to leverage HDFS and S3, which enterprises have been throwing their data in for at least half a decade now.

Spark is evolving quickly, so I’m not sure how well this book will age. (For example, SchemaRDDs are already renamed to Data Frames.) But the fundamentals covered are sure to be solid for a while.

If you have a background in Hadoop, this will be an easy read. If not, maybe reading an overview of HDFS and Hive first would help (and maybe S3). Maybe working through a word count map-reduce tutorial would be useful too, just for background knowledge.

I loved all the Python API examples because that’s the only language that I truly feel comfortable in. Java was familiar again from Hadoop. I don’t know Scala, but it looks like Python and Java had a weird baby or something, so I’ll check it out.

The chapter on MLLib won’t make too much sense unless you understand what the algorithms mean or you are reading it with some sort of reference open. Still, the examples are clear enough to give a flavor. I didn’t get a sense of how to actually evaluate models, but I doubt anyone would use MLLib without using the real online documentation.

Highly recommended.

big-data

Jin Shusong

78 reviews1 follower

February 20, 2017

The book is good for beginners of Spark. The author is one of the team members who wrote spark system. So, the intuitive thinking and concepts are definitely correct. This is very important for the beginners.

Joel

61 reviews2 followers

July 31, 2018

Data in all domains is getting bigger. How can you work with it efficiently? Recently updated for Spark 1.3, this book introduces Apache Spark, the open source cluster computing system that makes data analytics fast to write and fast to run. With Spark, you can tackle big datasets quickly through simple APIs in Python, Java, and Scala. This edition includes new information on Spark SQL, Spark Streaming, setup, and Maven coordinates.

Written by the developers of Spark, this book will have data scientists and engineers up and running in no time. You’ll learn how to express parallel jobs with just a few lines of code, and cover applications from simple batch jobs to stream processing and machine learning.

Quickly dive into Spark capabilities such as distributed datasets, in-memory caching, and the interactive shell
Leverage Spark’s powerful built-in libraries, including Spark SQL, Spark Streaming, and MLlib
Use one programming paradigm instead of mixing and matching tools like Hive, Hadoop, Mahout, and Storm
Learn how to deploy interactive, batch, and streaming applications
Connect to data sources including HDFS, Hive, JSON, and S3
Master advanced topics like data partitioning and shared variables

Michael Koltsov

110 reviews69 followers

November 13, 2015

This particular book should be included if Spark will eventually get a nice and shiny box version with caps and T-shirts inside. What more can I say? This book is partly written by the creator of Spark himself, hence it should be treated as a comprehensive and succinct manual which unfortunately it doesn’t have as of today (for free).

Luckily, if you’ve spent a relatively small amount of money to buy/read/learn/try all examples you will know two times more than a typical Spark developer with 1 year of experience under his belt.

The only problem I had with this book is that some of the examples don’t work because the book itself hasn’t been updated to be on track with the most recent version of Spark (which differs a lot from the version described in the book).

Score 4/5

Nitish Sheoran

1 review

August 5, 2016

When I started this book, I was basically looking for a book which can give me a good introduction to Apache spark and pyspark. This book was quite a good learning experience with programming examples with more emphasis on Java/Scala than Python. Pyspark is still less developed than Scala/Java, but there are still places where more detailed python examples can be included. So in conclusion, one of the best book for introducing Apache Spark and learning Spark using Java/Scalain market but lagging behind in its pyspark concepts.

Displaying 1 - 30 of 55 reviews

More reviews and ratings