Spark 2.0.0 truncate from Redshift table using jdbc

断了今生、忘了曾经 提交于 2019-12-13 07:22:56

问题


Hello I am using Spark SQL(2.0.0) with Redshift where I want to truncate my tables. I am using this spark-redshift package & I want to know how I can truncate my table.Can anyone please share example of this ??


回答1:


I was unable to accomplish this using Spark and the code in the spark-redshift repo that you have listed above.

I was, however, able to use AWS Lambda with psycopg2 to truncate a redshift table. Then I use boto3 to kick off my spark job via AWS Glue.

The important code below is cur.execute("truncate table yourschema.yourtable")

from __future__ import print_function
import sys
import psycopg2
import boto3

def lambda_handler(event, context):
    db_database = "your_redshift_db_name"
    db_user = "your_user_name"
    db_password = "your_password"
    db_port = "5439"
    db_host = "your_redshift.hostname.us-west-2.redshift.amazonaws.com"

    try:
        print("attempting to connect...")
        conn = psycopg2.connect(dbname=db_database, user=db_user, password=db_password, host=db_host, port=db_port)
        print("connected...")
        conn.autocommit = True
        cur = conn.cursor()
        count_sql = "select count(pivotid) from yourschema.yourtable"
        cur.execute(count_sql)
        results = cur.fetchone()

        print("countBefore: ", results[0])
        countOfPivots = results[0]
        if countOfPivots > 0:
            cur.execute("truncate table yourschema.yourtable")
            print("truncated yourschema.yourtable")
            cur.execute(count_sql)
            results = cur.fetchone()
            print("countAfter: ", results[0])

        cur.close()
        conn.close()

        glueClient = boto3.client("glue")
        startTriiggerResponse = glueClient.start_trigger(Name="your-awsglue-ondemand-trigger")
        print("startedTrigger:", startTriiggerResponse.Name)

        return results
    except Exception as e:
        print(e)
        raise e



回答2:


You need to specify the mode to the library before calling save. For example:

my_dataframe.write
   .format("com.databricks.spark.redshift")
   .option("url", "jdbc:redshift://my_cluster.qwertyuiop.eu-west-1.redshift.amazonaws.com:5439/my_database?user=my_user&password=my_password")
   .option("dbtable", "my_table")
   .option("tempdir", "s3://my-bucket")
   .option("diststyle", "KEY")
   .option("distkey", "dist_key")
   .option("sortkeyspec", "COMPOUND SORTKEY(key_1, key_2)")
   .option("extracopyoptions", "TRUNCATECOLUMNS COMPUPDATE OFF STATUPDATE OFF")
   .mode("overwrite") // "append" / "error"
   .save()


来源:https://stackoverflow.com/questions/40972861/spark-2-0-0-truncate-from-redshift-table-using-jdbc

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!