How to read parquet data from S3 to spark dataframe Python?

匿名 (未验证) 提交于 2019-12-03 01:20:02

问题:

I am new to Spark and I am not able to find this... I have a lot of parquet files uploaded into s3 at location :

s3://a-dps/d-l/sco/alpha/20160930/parquet/ 

The total size of this folder is 20+ Gb,. How to chunk and read this into a dataframe How to load all these files into a dataframe?

Allocated memory to spark cluster is 6 gb.

    from pyspark import SparkContext     from pyspark.sql import SQLContext     from pyspark import SparkConf     from pyspark.sql import SparkSession     import pandas     # SparkConf().set("spark.jars.packages","org.apache.hadoop:hadoop-aws:3.0.0-alpha3")     sc = SparkContext.getOrCreate()      sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", 'A')     sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", 's')      sqlContext = SQLContext(sc)     df2 = sqlContext.read.parquet("s3://sm/data/scor/alpha/2016/parquet/*") 

Error :

      Py4JJavaError: An error occurred while calling o33.parquet.     : java.io.IOException: No FileSystem for scheme: s3         at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)         at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)         at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)         at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)         at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)         at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:372)         at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)         at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)         at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)         at scala.collection.immutable.List.foreach(List.scala:381)         at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)         at scala.collection.immutable.List.flatMap(List.scala:344)   

回答1:

The file schema (s3)that you are using is not correct. You'll need to use the s3n schema or s3a (for bigger s3 objects):

// use sqlContext instead for spark <2  val df = spark.read                .load("s3n://bucket-name/object-path") 

I suggest that you read more about the Hadoop-AWS module: Integration with Amazon Web Services Overview.



回答2:

You've to use SparkSession instead of sqlContext since Spark 2.0

spark = SparkSession.builder                         .master("local")                                      .appName("app name")                                      .config("spark.some.config.option", true).getOrCreate()  df = spark.read.parquet("s3://path/to/parquet/file.parquet") 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!