可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I am trying to connect to a database with pyspark and I am using the following code:
sqlctx = SQLContext(sc) df = sqlctx.load( url = "jdbc:postgresql://[hostname]/[database]", dbtable = "(SELECT * FROM talent LIMIT 1000) as blah", password = "MichaelJordan", user = "ScottyPippen", source = "jdbc", driver = "org.postgresql.Driver" )
and I am getting the following error:

Any idea why is this happening?
Edit: I am trying to run the code locally in my computer.
回答1:
The following worked for me with postgres on localhost:
Download the PostgreSQL JDBC Driver from https://jdbc.postgresql.org/download.html.
For the pyspark
shell you use the SPARK_CLASSPATH
environment variable:
$ export SPARK_CLASSPATH=/path/to/downloaded/jar $ pyspark
For submitting a script via spark-submit
use the --driver-class-path
flag:
$ spark-submit --driver-class-path /path/to/downloaded/jar script.py
In the python script load the tables as a DataFrame
as follows:
from pyspark.sql import DataFrameReader url = 'postgresql://localhost:5432/dbname' properties = {'user': 'username', 'password': 'password'} df = DataFrameReader(sqlContext).jdbc( url='jdbc:%s' % url, table='tablename', properties=properties )
or alternatively:
df = sqlContext.read.format('jdbc').\ options(url='jdbc:%s' % url, dbtable='tablename').\ load()
Note that when submitting the script via spark-submit
, you need to define the sqlContext
.
回答2:
You normally need either:
- to install the Postgres Driver on your cluster,
- to provide the Postgres driver jar from your client with the --jars option
- or to provide the maven coordinates of the Postgres driver with --packages option.
If you detail how are you launching pyspark, we may give you more details.
Some clues/ideas:
spark-cannot-find-the-postgres-jdbc-driver
Not able to connect to postgres using jdbc in pyspark shell
回答3:
This exception means jdbc driver does not in driver classpath. you can spark-submit jdbc jars with --jar
parameter, also add it into driver classpath using spark.driver.extraClassPath
.
回答4:
One approach, building on the example per the quick start guide, is this blog post which shows how to add the --packages org.postgresql:postgresql:9.4.1211
argument to the spark-submit
command.
This downloads the driver into ~/.ivy2/jars
directory, in my case /Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar
. Passing this as the --driver-class-path
option gives the full spark-submit command of:
/usr/local/Cellar/apache-spark/2.0.2/bin/spark-submit\ --packages org.postgresql:postgresql:9.4.1211\ --driver-class-path /Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar\ --master local[4] main.py
And in main.py
:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() dataframe = spark.read.format('jdbc').options( url = "jdbc:postgresql://localhost/my_db?user=derekhill&password=''", database='my_db', dbtable='my_table' ).load() dataframe.show()
回答5:
It is necesary copy postgresql-42.1.4.jar in all nodes... for my case, I did copy in the path /opt/spark-2.2.0-bin-hadoop2.7/jars
Also, i set classpath in ~/.bashrc (export SPARK_CLASSPATH="/opt/spark-2.2.0-bin-hadoop2.7/jars" )
and work fine in pyspark console and jupyter