How to connect to Amazon Redshift or other DB's in Apache Spark?

后端未结

关注

 6  2083

刺人心 2021-01-13 09:44

I\'m trying to connect to Amazon Redshift via Spark, so I can join data we have on S3 with data on our RS cluster. I found some very spartan documentation here for the capab

6条回答

温柔的废话 (楼主)

2021-01-13 10:02

The simplest way to make a jdbc connection to Redshift using python is as follows:

# -*- coding: utf-8 -*-
from pyspark.sql import SparkSession

jdbc_url = "jdbc:redshift://xxx.xxx.redshift.amazonaws.com:5439/xxx"
jdbc_user = "xxx"
jdbc_password = "xxx"
jdbc_driver = "com.databricks.spark.redshift"

spark = SparkSession.builder.master("yarn") \
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
.enableHiveSupport().getOrCreate()

# Read data from a query
df = spark.read \
    .format(jdbc_driver) \
    .option("url", jdbc_url + "?user="+ jdbc_user +"&password="+ jdbc_password) \
    .option("query", "your query") \
    .load()

0 讨论(0)

查看其它6个回答