I\'m trying to connect to Amazon Redshift via Spark, so I can join data we have on S3 with data on our RS cluster. I found some very spartan documentation here for the capab
The simplest way to make a jdbc connection to Redshift using python is as follows:
# -*- coding: utf-8 -*-
from pyspark.sql import SparkSession
jdbc_url = "jdbc:redshift://xxx.xxx.redshift.amazonaws.com:5439/xxx"
jdbc_user = "xxx"
jdbc_password = "xxx"
jdbc_driver = "com.databricks.spark.redshift"
spark = SparkSession.builder.master("yarn") \
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.enableHiveSupport().getOrCreate()
# Read data from a query
df = spark.read \
.format(jdbc_driver) \
.option("url", jdbc_url + "?user="+ jdbc_user +"&password="+ jdbc_password) \
.option("query", "your query") \
.load()