I am trying to connect to a database with pyspark and I am using the following code:
sqlctx = SQLContext(sc)
df = sqlctx.load(
url = "jdbc:postgresql
This exception means jdbc driver does not in driver classpath.
you can spark-submit jdbc jars with --jar parameter, also add it into driver classpath using spark.driver.extraClassPath.
One approach, building on the example per the quick start guide, is this blog post which shows how to add the --packages org.postgresql:postgresql:9.4.1211 argument to the spark-submit command.
This downloads the driver into ~/.ivy2/jars directory, in my case /Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar. Passing this as the --driver-class-path option gives the full spark-submit command of:
/usr/local/Cellar/apache-spark/2.0.2/bin/spark-submit\
--packages org.postgresql:postgresql:9.4.1211\
--driver-class-path /Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar\
--master local[4] main.py
And in main.py:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
dataframe = spark.read.format('jdbc').options(
url = "jdbc:postgresql://localhost/my_db?user=derekhill&password=''",
database='my_db',
dbtable='my_table'
).load()
dataframe.show()
To use pyspark and jupyter notebook notebook: first open pyspark with
pyspark --driver-class-path /spark_drivers/postgresql-42.2.12.jar --jars /spark_drivers/postgresql-42.2.12.jar
Then in jupyter notebook
import os
jardrv = "~/spark_drivers/postgresql-42.2.12.jar"
from pyspark.sql import SparkSession
spark = SparkSession.builder.config('spark.driver.extraClassPath', jardrv).getOrCreate()
url = 'jdbc:postgresql://127.0.0.1/dbname'
properties = {'user': 'usr', 'password': 'pswd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)
It is necesary copy postgresql-42.1.4.jar in all nodes... for my case, I did copy in the path /opt/spark-2.2.0-bin-hadoop2.7/jars
Also, i set classpath in ~/.bashrc (export SPARK_CLASSPATH="/opt/spark-2.2.0-bin-hadoop2.7/jars" )
and work fine in pyspark console and jupyter
You normally need either:
If you detail how are you launching pyspark, we may give you more details.
Some clues/ideas:
spark-cannot-find-the-postgres-jdbc-driver
Not able to connect to postgres using jdbc in pyspark shell
The following worked for me with postgres on localhost:
Download the PostgreSQL JDBC Driver from https://jdbc.postgresql.org/download.html.
For the pyspark shell you use the SPARK_CLASSPATH environment variable:
$ export SPARK_CLASSPATH=/path/to/downloaded/jar
$ pyspark
For submitting a script via spark-submit use the --driver-class-path flag:
$ spark-submit --driver-class-path /path/to/downloaded/jar script.py
In the python script load the tables as a DataFrame as follows:
from pyspark.sql import DataFrameReader
url = 'postgresql://localhost:5432/dbname'
properties = {'user': 'username', 'password': 'password'}
df = DataFrameReader(sqlContext).jdbc(
url='jdbc:%s' % url, table='tablename', properties=properties
)
or alternatively:
df = sqlContext.read.format('jdbc').\
options(url='jdbc:%s' % url, dbtable='tablename').\
load()
Note that when submitting the script via spark-submit, you need to define the sqlContext.