Is HDFS necessary for Spark workloads?

后端未结

关注

 4  1620

夕颜 2021-01-06 05:39

HDFS is not necessary but recommendations appear in some places.

To help evaluate the effort spent in getting HDFS running:

What are the benefits of

4条回答

半阙折子戏 (楼主)

2021-01-06 06:36

So you could go with Cloudera or Hortenworks distro and load up an entire stack very easily. CDH will be used with YARN though I find it so much more difficult to configure mesos in CDH. Horten is much easier to customize.

HDFS is great because of datanodes = data locality (process where the data is) as shuffling/data transfer is very expensive. HDFS also naturally blocks files which allows Spark to partition on the blocks. (128mb blocks, you can change this).

You could use S3 and Redshift.

See here: https://github.com/databricks/spark-redshift

0 讨论(0)

查看其它4个回答

发布评论:

提交评论

加载中...

验证码

看不清?

提交回复