Reading from google storage gs:// filesystem from local spark instance

前端 未结 3 1145
情深已故
情深已故 2021-01-07 05:58

The problem is quite simple: You have a local spark instance (either cluster or just running it in local mode) and you want to read from gs://

3条回答
  •  余生分开走
    2021-01-07 06:31

    Considering that it has been awhile since the last answer, I though I would share my recent solution. Note, the following instruction is for Spark 2.4.4.

    1. Download the "gcs-connector" for the type of Spark/Hadoop you have got from here. Search for "Other Spark/Hadoop clusters" topic.
    2. Move the "gcs-connector" to $SPARK_HOME/jars. See more about $SPARK_HOME below.
    3. Make sure that all the environment variables are properly set up for you Spark application to run. This is:
      a. SPARK_HOME pointing to the location where you have saved Spark installations.
      b. GOOGLE_APPLICATION_CREDENTIALS pointing to the location where json key is. If you have just downloaded it, it will be in your ~/Downloads
      c. JAVA_HOME pointing to the location where you have your Java 8* "Home" folder.

      If you are on Linux/Mac OS you can use export VAR=DIR, where VAR is variable and DIR the location, or if you want to set them up permanently, you can add them to ~/.bash_profile or ~/.zshrc files. For Windows OS users, in cmd write set VAR=DIR for shell related operations, or setx VAR DIR to store the variables permanently.

    That has worked for me and I hope it help others too.

    * Spark works on Java 8, therefore some of its features might not be compatible with the latest Java Development Kit.

提交回复
热议问题