Overwrite MySQL tables with AWS Glue

问题

I have a lambda process which occasionally polls an API for recent data. This data has unique keys, and I'd like to use Glue to update the table in MySQL. Is there an option to overwrite data using this key? (Similar to Spark's mode=overwrite). If not - might I be able to truncate the table in Glue before inserting all new data?

Thanks

回答1:

I ran into the same issue with Redshift, and the best solution we could come up with was to create a Java class that loads the MySQL driver and issues a truncate table:

package com.my.glue.utils.mysql;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.SQLException;
import java.sql.Statement;

@SuppressWarnings("unused")
public class MySQLTruncateClient {
    public void truncate(String tableName, String url) throws SQLException, ClassNotFoundException {
        Class.forName("com.mysql.jdbc.Driver");
        try (Connection mysqlConnection = DriverManager.getConnection(url);
            Statement statement = mysqlConnection.createStatement()) {
            statement.execute(String.format("TRUNCATE TABLE %s", tableName));
        }
    }
}

Upload that JAR to S3 along with your MySQL Jar dependency and make your job dependent on those. In your PySpark script, you can load your truncate method with:

java_import(glue_context._jvm, "com.my.glue.utils.mysql.MySQLTruncateClient")
truncate_client = glue_context._jvm.MySQLTruncateClient()
truncate_client.truncate('my_table', 'jdbc:mysql://...')

回答2:

The workaround I've come up with, which is a little simpler than the alternative posted, is the following:

Create a staging table in mysql, and load your new data into this table.
Run the command: REPLACE INTO myTable SELECT * FROM myStagingTable;
Truncate the staging table

This can be done with:

import sys from awsglue.transforms
import * from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

import pymysql
pymysql.install_as_MySQLdb()
import MySQLdb
db = MySQLdb.connect("URL", "USERNAME", "PASSWORD", "DATABASE")
cursor = db.cursor()
cursor.execute("REPLACE INTO myTable SELECT * FROM myStagingTable")
cursor.fetchall()

db.close()
job.commit()

回答3:

I found a simpler way working with JDBC connections in Glue. The way the Glue team recommends to truncate a table is via following sample code when you're writing data to your Redshift cluster:

datasink5 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = resolvechoice4, catalog_connection = "<connection-name>", connection_options = {"dbtable": "<target-table>", "database": "testdb", "preactions":"TRUNCATE TABLE <table-name>"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink5")

where

connection-name your Glue connection name to your Redshift Cluster
target-table    the table you're loading the data in 
testdb          name of the database 
table-name      name of the table to truncate (ideally the table you're loading into)

来源：https://stackoverflow.com/questions/47556678/overwrite-mysql-tables-with-aws-glue

标签

mysql

amazon-web-services

pyspark

aws-glue