how can aws glue job upload several tables in redshift

家住魔仙堡 提交于 2019-12-10 18:17:06

问题


Is it possible to load multiple tables in Redshift using AWS Glue job?

These are the steps I followed.

  1. Crawled json from S3 and the data has been translated into data catalog table.
  2. I created a job that will upload the data catalog table in redshift but it only limits me to upload 1 table for every job. In the job properties (in adding a job), This job runs option I chose is: A proposed script generated by AWS Glue.

I am not familiar with python and I am new to AWS Glue. but I have several tables needed to be uploaded.

here is a sample script:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "sampledb", table_name = "abs", transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "sampledb", table_name = "abs", transformation_ctx = "datasource0")
## @type: ApplyMapping
## @args: [mapping = [("value", "int", "value", "int"), ("sex", "string", "sex", "string"), ("age", "string", "age", "string"), ("highest year of school completed", "string", "highest year of school completed", "string"), ("state", "string", "state", "string"), ("region type", "string", "region type", "string"), ("lga 2011", "string", "lga 2011", "string"), ("frequency", "string", "frequency", "string"), ("time", "string", "time", "string")], transformation_ctx = "applymapping1"]
## @return: applymapping1
## @inputs: [frame = datasource0]
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("value", "int", "value", "int"), ("sex", "string", "sex", "string"), ("age", "string", "age", "string"), ("highest year of school completed", "string", "highest year of school completed", "string"), ("state", "string", "state", "string"), ("region type", "string", "region type", "string"), ("lga 2011", "string", "lga 2011", "string"), ("frequency", "string", "frequency", "string"), ("time", "string", "time", "string")], transformation_ctx = "applymapping1")
## @type: ResolveChoice
## @args: [choice = "make_cols", transformation_ctx = "resolvechoice2"]
## @return: resolvechoice2
## @inputs: [frame = applymapping1]
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_cols", transformation_ctx = "resolvechoice2")
## @type: DropNullFields
## @args: [transformation_ctx = "dropnullfields3"]
## @return: dropnullfields3
## @inputs: [frame = resolvechoice2]
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
## @type: DataSink
## @args: [catalog_connection = "redshift", connection_options = {"dbtable": "abs", "database": "dbmla"}, redshift_tmp_dir = TempDir, transformation_ctx = "datasink4"]
## @return: datasink4
## @inputs: [frame = dropnullfields3]
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields3, catalog_connection = "redshift", connection_options = {"dbtable": "abs", "database": "dbmla"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")
job.commit()

aws glue database: sampledb
table name in aws glue: abs
redshift database: dbmla

Please site an example on how to upload them. thanks!


回答1:


According to AWS Glue FAQ, you can modify the generated code, and run the job.

Q: How can I customize the ETL code generated by AWS Glue?

AWS Glue’s ETL script recommendation system generates Scala or Python code. It leverages Glue’s custom ETL library to simplify access to data sources as well as manage job execution. You can find more details about the library in our documentation. You can write ETL code using AWS Glue’s custom library or write arbitrary code in Scala or Python by using inline editing via the AWS Glue Console script editor, downloading the auto-generated code, and editing it in your own IDE. You can also start with one of the many samples hosted in our Github repository and customize that code.

So please try add code snippets for additional tables to the same script as follows,

datasource1 = glueContext.create_dynamic_frame.from_catalog(database = "sampledb", table_name = "abs2", transformation_ctx = "datasource1")
applymapping2 = ApplyMapping.apply(.. transformation_ctx = "applymapping2")
resolvechoice2 = ResolveChoice.apply(frame = applymapping2, choice = "make_cols", transformation_ctx = "resolvechoice2")
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields3, catalog_connection = "redshift", connection_options = {"dbtable": "abs2", "database": "dbmla"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")

datasource2 = glueContext.create_dynamic_frame.from_catalog(database = "sampledb", table_name = "abs2", transformation_ctx = "datasource1")
applymapping2 = ApplyMapping.apply(.. transformation_ctx = "applymapping2")
resolvechoice2 = ResolveChoice.apply(frame = applymapping2, choice = "make_cols", transformation_ctx = "resolvechoice2")
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields3, catalog_connection = "redshift", connection_options = {"dbtable": "abs2", "database": "dbmla"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")

datasource3 = glueContext.create_dynamic_frame.from_catalog(database = "sampledb", table_name = "abs2", transformation_ctx = "datasource1")
applymapping2 = ApplyMapping.apply(.. transformation_ctx = "applymapping2")
resolvechoice2 = ResolveChoice.apply(frame = applymapping2, choice = "make_cols", transformation_ctx = "resolvechoice2")
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields3, catalog_connection = "redshift", connection_options = {"dbtable": "abs2", "database": "dbmla"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")

job.commit()

Change the variable names accordingly so as to be unique. Thanks



来源:https://stackoverflow.com/questions/50459840/how-can-aws-glue-job-upload-several-tables-in-redshift

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!