Working with jdbc jar in pyspark

拥有回忆 提交于 2019-12-01 20:54:45

I found a solution which works (Don't know if it is the best one so feel free to continue commenting). Apparently, If I add the option: driver="org.postgresql.Driver", this works properly. i.e. My full line (inside pyspark) is:

df = sqlContext.read.format("jdbc").options(url="jdbc:postgresql://ip_address:port/db_name?user=myuser&password=mypasswd", dbtable="table_name",driver="org.postgresql.Driver").load()
df.count()

Another thing: If you are already using a fat jar of your own (I am in my full application) then all you need to do is add the jdbc driver to your pom file as such:

    <dependency>
      <groupId>org.postgresql</groupId>
      <artifactId>postgresql</artifactId>
      <version>9.4.1208</version>
    </dependency>

and then you don't have to add the driver as a separate jar, just use the jar with dependencies.

What version of the documentation are you looking at ? It seems like compute-classpath.sh was deprecated a while back - as of Spark 1.3.1:

$ unzip -l spark-1.3.1.zip | egrep '\.sh' | egrep classpa
 6592  2015-04-11 00:04   spark-1.3.1/bin/compute-classpath.sh

$ unzip -l spark-1.4.0.zip | egrep '\.sh' | egrep classpa

produces nothing.

I think you should be using load-spark-env.sh to set your classpath:

$/opt/spark-1.6.0-bin-hadoop2.6/bin/load-spark-env.sh

and you'll need to set SPARK_CLASSPATH in your $SPARK_HOME/conf/spark-env.sh file (which you'll copy over from the template file $SPARK_HOME/conf/spark-env.sh.template).

I think that this is another manifestation of the issue described and fixed here: https://github.com/apache/spark/pull/12000. I authored that fix 3 weeks ago and there has been no movement on it. Maybe if others also express the fact that they have been affected by it, it may help?

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!