pyspark error: : java.io.IOException: No FileSystem for scheme: gs

前端 未结 1 1551
野性不改
野性不改 2020-12-21 16:50

I am trying to read a json file from a google bucket into a pyspark dataframe on a local spark machine. Here\'s the code:

import pandas as pd
import numpy a         


        
相关标签:
1条回答
  • 2020-12-21 17:23

    Some config params are required to recognize "gs" as a distributed filesystem.

    Use this setting for google cloud storage connector, gcs-connector-hadoop2-latest.jar

    spark = SparkSession \
            .builder \
            .config("spark.jars", "/path/to/gcs-connector-hadoop2-latest.jar") \
            .getOrCreate()
    

    Other configs that can be set from pyspark

    spark._jsc.hadoopConfiguration().set('fs.gs.impl', 'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem')
    # This is required if you are using service account and set true, 
    spark._jsc.hadoopConfiguration().set('fs.gs.auth.service.account.enable', 'ture')
    spark._jsc.hadoopConfiguration().set('google.cloud.auth.service.account.json.keyfile', "/path/to/keyfile")
    # Following are required if you are using oAuth
    spark._jsc.hadoopConfiguration().set('fs.gs.auth.client.id', 'YOUR_OAUTH_CLIENT_ID')
    spark._jsc.hadoopConfiguration().set('fs.gs.auth.client.secret', 'OAUTH_SECRET')
    

    Alternatively you can set up these configs in core-site.xml or spark-defaults.conf.

    0 讨论(0)
提交回复
热议问题