可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am trying to connect to a database with pyspark and I am using the following code:

sqlctx = SQLContext(sc) df = sqlctx.load(     url = "jdbc:postgresql://[hostname]/[database]",     dbtable = "(SELECT * FROM talent LIMIT 1000) as blah",     password = "MichaelJordan",     user =  "ScottyPippen",     source = "jdbc",     driver = "org.postgresql.Driver" )

and I am getting the following error:

Any idea why is this happening?

Edit: I am trying to run the code locally in my computer.

回答1:

The following worked for me with postgres on localhost:

Download the PostgreSQL JDBC Driver from https://jdbc.postgresql.org/download.html.

For the pyspark shell you use the SPARK_CLASSPATH environment variable:

$ export SPARK_CLASSPATH=/path/to/downloaded/jar $ pyspark

For submitting a script via spark-submit use the --driver-class-path flag:

$ spark-submit --driver-class-path /path/to/downloaded/jar script.py

In the python script load the tables as a DataFrame as follows:

from pyspark.sql import DataFrameReader  url = 'postgresql://localhost:5432/dbname' properties = {'user': 'username', 'password': 'password'} df = DataFrameReader(sqlContext).jdbc(     url='jdbc:%s' % url, table='tablename', properties=properties )

or alternatively:

df = sqlContext.read.format('jdbc').\     options(url='jdbc:%s' % url, dbtable='tablename').\     load()

Note that when submitting the script via spark-submit, you need to define the sqlContext.

回答2:

You normally need either:

to install the Postgres Driver on your cluster,
to provide the Postgres driver jar from your client with the --jars option
or to provide the maven coordinates of the Postgres driver with --packages option.

If you detail how are you launching pyspark, we may give you more details.

Some clues/ideas:

spark-cannot-find-the-postgres-jdbc-driver

Not able to connect to postgres using jdbc in pyspark shell

回答3:

This exception means jdbc driver does not in driver classpath. you can spark-submit jdbc jars with --jar parameter, also add it into driver classpath using spark.driver.extraClassPath.

回答4:

One approach, building on the example per the quick start guide, is this blog post which shows how to add the --packages org.postgresql:postgresql:9.4.1211 argument to the spark-submit command.

This downloads the driver into ~/.ivy2/jars directory, in my case /Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar. Passing this as the --driver-class-path option gives the full spark-submit command of:

/usr/local/Cellar/apache-spark/2.0.2/bin/spark-submit\  --packages org.postgresql:postgresql:9.4.1211\  --driver-class-path /Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar\  --master local[4] main.py

And in main.py:

 from pyspark.sql import SparkSession   spark = SparkSession.builder.getOrCreate()   dataframe = spark.read.format('jdbc').options(          url = "jdbc:postgresql://localhost/my_db?user=derekhill&password=''",          database='my_db',          dbtable='my_table'      ).load()   dataframe.show()

回答5:

It is necesary copy postgresql-42.1.4.jar in all nodes... for my case, I did copy in the path /opt/spark-2.2.0-bin-hadoop2.7/jars

Also, i set classpath in ~/.bashrc (export SPARK_CLASSPATH="/opt/spark-2.2.0-bin-hadoop2.7/jars" )

and work fine in pyspark console and jupyter

文章来源: Using pyspark to connect to PostgreSQL

标签

jdbc

postgresql

connect