I am using Apache-Spark on Amazon Web Service (AWS)-EC2 to load and process data. I\'ve created one master and two slave nodes. On the master node, I have a directory data
One quick suggestion is to load csv from S3 instead of having it in local.
Here is a sample scala snippet which can be used to load a bucket from S3
val csvs3Path = "s3n://REPLACE_WITH_YOUR_ACCESS_KEY:REPLACE_WITH_YOUR_SECRET_KEY@REPLACE_WITH_YOUR_S3_BUCKET"
val dataframe = sqlContext.
read.
format("com.databricks.spark.csv").
option("header", "true").
load(leadsS3Path)