Is HDFS necessary for Spark workloads?

HDFS is not necessary but recommendations appear in some places.

To help evaluate the effort spent in getting HDFS running:

What are the benefits of using HDFS for Spark workloads?

Spark is a distributed processing engine and HDFS is a distributed storage system.

If HDFS is not an option, then Spark has to use some other alternative in form of Apache Cassandra Or Amazon S3.

Have a look at this comparision

S3 – Non urgent batch jobs. S3 fits very specific use cases, when data locality isn’t critical.

Cassandra – Perfect for streaming data analysis and an overkill for batch jobs.

HDFS – Great fit for batch jobs without compromising on data locality.

When to use HDFS as storage engine for Spark distributed processing?

If you have big Hadoop cluster already in place and looking for real time analytics of your data, Spark can use existing Hadoop cluster. It will reduce development time.
Spark is in-memory computing engine. Since data can't fit into memory always, data has to be spilled to disk for some operations. Spark will benifit from HDFS in this case. The Teragen sorting record achieved by Spark used HDFS storage for sorting operation.
HDFS is scalable, reliable and fault tolerant distributed file system ( since Hadoop 2.x release). With data locality principle, processing speed is improved.
Best for Batch-processing jobs.

arj

The shortest answer is:"No, you don't need it". You can analyse data even without HDFS, but off course you need to replicate the data on all your nodes.

The long answer is quite counterintuitive and i'm still tryng to understand it with the help stackoverflow community.

Spark local vs hdfs permormance

HDFS (or any distributed Filesystems) makes distributing your data much simpler. Using a local filesystem you would have to partition/copy the data by hand to the individual nodes and be aware of the data distribution when running your jobs. In addition HDFS also handles failing nodes failures. From an integration between Spark and HDFS, you can imagine spark knowing about the data distribution so it will try to schedule tasks to the same nodes where the required data resides.

Second: which problems did you face exactly with the instruction?

BTW: if you are just looking for an easy setup on AWS, DCOS allows you to install HDFS with a single command...

So you could go with Cloudera or Hortenworks distro and load up an entire stack very easily. CDH will be used with YARN though I find it so much more difficult to configure mesos in CDH. Horten is much easier to customize.

HDFS is great because of datanodes = data locality (process where the data is) as shuffling/data transfer is very expensive. HDFS also naturally blocks files which allows Spark to partition on the blocks. (128mb blocks, you can change this).

You could use S3 and Redshift.

See here: https://github.com/databricks/spark-redshift

来源：https://stackoverflow.com/questions/32669187/is-hdfs-necessary-for-spark-workloads

标签

Hadoop

apache-spark

HDFS

mesos

mesosphere