load external libraries inside pyspark code

China☆狼群 提交于 2020-01-01 17:25:30

问题


I have a spark cluster I use in local mode. I want to read a csv with the databricks external library spark.csv. I start my app as follows:

import os
import sys

os.environ["SPARK_HOME"] = "/home/mebuddy/Programs/spark-1.6.0-bin-hadoop2.6"

spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))

from pyspark import SparkContext, SparkConf, SQLContext

try:
    sc
except NameError:
    print('initializing SparkContext...')
    sc=SparkContext()
sq = SQLContext(sc)
df = sq.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("/my/path/to/my/file.csv")

When I run it, I get the following error:

java.lang.ClassNotFoundException: Failed to load class for data source: com.databricks.spark.csv.

My question: how can I load the databricks.spark.csv library INSIDE my python code. I don't want to load it from outside (using --packages) from instance.

I tried to add the following lines but it did not work:

os.environ["SPARK_CLASSPATH"] = '/home/mebuddy/Programs/spark_lib/spark-csv_2.11-1.3.0.jar'

回答1:


If you create SparkContext from scratch you can for example set PYSPARK_SUBMIT_ARGS before SparkContext is intialized:

os.environ["PYSPARK_SUBMIT_ARGS"] = (
  "--packages com.databricks:spark-csv_2.11:1.3.0 pyspark-shell"
)

sc = SparkContext()

If for some reason you expect that SparkContext has been already initialized, as it is suggested by your code, this won't work. In local mode you could try to use Py4J gateway and URLClassLoader but it doesn't look like a good idea and won't work in a cluster mode.



来源:https://stackoverflow.com/questions/35346567/load-external-libraries-inside-pyspark-code

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!