reading google bucket data in spark

回眸只為那壹抹淺笑 提交于 2019-12-23 20:18:00

问题


I have followed this blog to read data stored in google bucket. https://cloud.google.com/dataproc/docs/connectors/install-storage-connector It has worked fine. The following command

hadoop fs -ls gs://the-bucket-you-want-to-list

gave me expected results.But when I tried reading data using pyspark using

rdd = sc.textFile("gs://crawl_tld_bucket/"),

it throws the following error:

`

py4j.protocol.Py4JJavaError: An error occurred while calling o20.partitions.
: java.io.IOException: No FileSystem for scheme: gs
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
`

How to get it done?


回答1:


To access Google Cloud Storage you have to include Cloud Storage connector:

spark-submit --jars /path/to/gcs/gcs-connector-latest-hadoop2.jar your-pyspark-script.py

or

pyspark --jars /path/to/gcs/gcs-connector-latest-hadoop2.jar


来源:https://stackoverflow.com/questions/46176907/reading-google-bucket-data-in-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!