Using pyspark to connect to PostgreSQL

前端 未结 10 1407
逝去的感伤
逝去的感伤 2020-12-01 04:50

I am trying to connect to a database with pyspark and I am using the following code:

sqlctx = SQLContext(sc)
df = sqlctx.load(
    url = "jdbc:postgresql         


        
10条回答
  •  渐次进展
    2020-12-01 05:14

    The following worked for me with postgres on localhost:

    Download the PostgreSQL JDBC Driver from https://jdbc.postgresql.org/download.html.

    For the pyspark shell you use the SPARK_CLASSPATH environment variable:

    $ export SPARK_CLASSPATH=/path/to/downloaded/jar
    $ pyspark
    

    For submitting a script via spark-submit use the --driver-class-path flag:

    $ spark-submit --driver-class-path /path/to/downloaded/jar script.py
    

    In the python script load the tables as a DataFrame as follows:

    from pyspark.sql import DataFrameReader
    
    url = 'postgresql://localhost:5432/dbname'
    properties = {'user': 'username', 'password': 'password'}
    df = DataFrameReader(sqlContext).jdbc(
        url='jdbc:%s' % url, table='tablename', properties=properties
    )
    

    or alternatively:

    df = sqlContext.read.format('jdbc').\
        options(url='jdbc:%s' % url, dbtable='tablename').\
        load()
    

    Note that when submitting the script via spark-submit, you need to define the sqlContext.

提交回复
热议问题