How to connect to Amazon Redshift or other DB's in Apache Spark?

后端 未结 6 2083
刺人心
刺人心 2021-01-13 09:44

I\'m trying to connect to Amazon Redshift via Spark, so I can join data we have on S3 with data on our RS cluster. I found some very spartan documentation here for the capab

6条回答
  •  温柔的废话
    2021-01-13 10:02

    The simplest way to make a jdbc connection to Redshift using python is as follows:
    
    # -*- coding: utf-8 -*-
    from pyspark.sql import SparkSession
    
    jdbc_url = "jdbc:redshift://xxx.xxx.redshift.amazonaws.com:5439/xxx"
    jdbc_user = "xxx"
    jdbc_password = "xxx"
    jdbc_driver = "com.databricks.spark.redshift"
    
    spark = SparkSession.builder.master("yarn") \
    .config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
    .enableHiveSupport().getOrCreate()
    
    # Read data from a query
    df = spark.read \
        .format(jdbc_driver) \
        .option("url", jdbc_url + "?user="+ jdbc_user +"&password="+ jdbc_password) \
        .option("query", "your query") \
        .load()
    

提交回复
热议问题