Is HDFS necessary for Spark workloads?

后端 未结 4 1620
夕颜
夕颜 2021-01-06 05:39

HDFS is not necessary but recommendations appear in some places.

To help evaluate the effort spent in getting HDFS running:

What are the benefits of

4条回答
  •  半阙折子戏
    2021-01-06 06:36

    So you could go with Cloudera or Hortenworks distro and load up an entire stack very easily. CDH will be used with YARN though I find it so much more difficult to configure mesos in CDH. Horten is much easier to customize.

    HDFS is great because of datanodes = data locality (process where the data is) as shuffling/data transfer is very expensive. HDFS also naturally blocks files which allows Spark to partition on the blocks. (128mb blocks, you can change this).

    You could use S3 and Redshift.

    See here: https://github.com/databricks/spark-redshift

提交回复
热议问题