HDFS is not necessary but recommendations appear in some places.
To help evaluate the effort spent in getting HDFS running:
What are the benefits of
So you could go with Cloudera or Hortenworks distro and load up an entire stack very easily. CDH will be used with YARN though I find it so much more difficult to configure mesos in CDH. Horten is much easier to customize.
HDFS is great because of datanodes = data locality (process where the data is) as shuffling/data transfer is very expensive. HDFS also naturally blocks files which allows Spark to partition on the blocks. (128mb blocks, you can change this).
You could use S3 and Redshift.
See here: https://github.com/databricks/spark-redshift