How to get csv on s3 with pyspark (No FileSystem for scheme: s3n)

问题

There are many similar questions on SO, but I simply cannot get this to work. I'm obviously missing something.

Trying to load a simple test csv file from my s3.

Doing it locally, like below, works.

from pyspark.sql import SparkSession
from pyspark import SparkContext as sc

logFile = "sparkexamplefile.csv"
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()

logData = spark.read.text(logFile).cache()

numAs = logData.filter(logData.value.contains('a')).count()
numBs = logData.filter(logData.value.contains('b')).count()

print("Lines with a: %i, lines with b: %i" % (numAs, numBs))

But if I add this below:

sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "foo")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "bar")
lines = sc.textFile("s3n:///mybucket-sparkexample/sparkexamplefile.csv")
lines.count()

I get:

No FileSystem for scheme: s3n

I've also tried changing s3 to spark.sparkContext without any difference

Also swapping // and /// in the url

Even better, I'd rather do this and go straight to data frame:

dataFrame = spark.read.csv("s3n:///mybucket-sparkexample/sparkexamplefile.csv")

Also I am slightly AWS ignorant, so I have tried s3, s3n, and s3a to no avail.

I've been around the internet and back but can't seem to resolve the scheme error. Thanks!

回答1:

I think your spark environment didn't get aws jars. You need to add it for using s3 or s3n.

You have to copy required jar files from a hadoop download into the $SPARK_HOME/jars directory. Using the --jars flag or the --packages flag for spark-submit didn't work.

Here my spark version is Spark 2.3.0 and Hadoop 2.7.6 so you have to copy to jars from (hadoop dir)/share/hadoop/tools/lib/ to $SPARK_HOME/jars.

aws-java-sdk-1.7.4.jar
hadoop-aws-2.7.6.jar

回答2:

You must check what is your version of hadoop*. jar files bound to your specific version of pyspark installed on your system, search for folder pyspark/jars and files hadoop*.

The version observed you pass into your pyspark file like this:

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk-pom:1.11.538,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'

This is bit tricky for new joiners on pyspark (I faced this directly my first day with pyspark :-)).

Otherwise I am on Gentoo system with local Spark 2.4.2. Some suggested to install also Hadoop and copy the jars directly to Spark, still should be same version as PySpark is using. So I am creating ebuild for Gentoo for these versions...

来源：https://stackoverflow.com/questions/54358250/how-to-get-csv-on-s3-with-pyspark-no-filesystem-for-scheme-s3n

标签

python

apache-spark

pyspark