I am trying to checkpoint the rdd to non-hdfs system. From DSE document it seems that it is not possible to use cassandra file system. So I am planning to use amazon s3 . Bu
From the answer in the link
Solution 1:
export AWS_ACCESS_KEY_ID=<your access>
export AWS_SECRET_ACCESS_KEY=<your secret>
ssc.checkpoint(checkpointDirectory)
Set the checkpoint directory as S3 URL -
s3n://spark-streaming/checkpoint
And then launch your spark application using spark submit.
This works in spark 1.4.2
solution 2:
val hadoopConf: Configuration = new Configuration()
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3n.awsAccessKeyId", "id-1")
hadoopConf.set("fs.s3n.awsSecretAccessKey", "secret-key")
StreamingContext.getOrCreate(checkPointDir, () => {
createStreamingContext(checkPointDir, config)
}, hadoopConf)
To Checkpoint to S3, you can pass the following notation to StreamingContext def checkpoint(directory: String): Unit
method
s3n://<aws-access-key>:<aws-secret-key>@<s3-bucket>/<prefix ...>
Another reliable file system not listed in the Spark Documentation for checkpointing, is Taychyon