I am trying to read a json file from a google bucket into a pyspark dataframe on a local spark machine. Here\'s the code:
import pandas as pd
import numpy a
Some config params are required to recognize "gs" as a distributed filesystem.
Use this setting for google cloud storage connector, gcs-connector-hadoop2-latest.jar
spark = SparkSession \
.builder \
.config("spark.jars", "/path/to/gcs-connector-hadoop2-latest.jar") \
.getOrCreate()
Other configs that can be set from pyspark
spark._jsc.hadoopConfiguration().set('fs.gs.impl', 'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem')
# This is required if you are using service account and set true,
spark._jsc.hadoopConfiguration().set('fs.gs.auth.service.account.enable', 'ture')
spark._jsc.hadoopConfiguration().set('google.cloud.auth.service.account.json.keyfile', "/path/to/keyfile")
# Following are required if you are using oAuth
spark._jsc.hadoopConfiguration().set('fs.gs.auth.client.id', 'YOUR_OAUTH_CLIENT_ID')
spark._jsc.hadoopConfiguration().set('fs.gs.auth.client.secret', 'OAUTH_SECRET')
Alternatively you can set up these configs in core-site.xml or spark-defaults.conf.