How to set up a local development environment for Scala Spark ETL to run in AWS Glue?

有些话、适合烂在心里 提交于 2019-11-27 03:25:56

问题


I'd like to be able to write Scala in my local IDE and then deploy it to AWS Glue as part of a build process. But I'm having trouble finding the libraries required to build the GlueApp skeleton generated by AWS.

The aws-java-sdk-glue doesn't contain the classes imported, and I can't find those libraries anywhere else. Though they must exist somewhere, but perhaps they are just a Java/Scala port of this library: aws-glue-libs

The template scala code from AWS:

import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.MappingSpec
import com.amazonaws.services.glue.errors.CallSite
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._

object GlueApp {
  def main(sysArgs: Array[String]) {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    // @params: [JOB_NAME]
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)
    // @type: DataSource
    // @args: [database = "raw-tickers-oregon", table_name = "spark_delivery_2_1", transformation_ctx = "datasource0"]
    // @return: datasource0
    // @inputs: []
    val datasource0 = glueContext.getCatalogSource(database = "raw-tickers-oregon", tableName = "spark_delivery_2_1", redshiftTmpDir = "", transformationContext = "datasource0").getDynamicFrame()
    // @type: ApplyMapping
    // @args: [mapping = [("exchangeid", "int", "exchangeid", "int"), ("data", "struct", "data", "struct")], transformation_ctx = "applymapping1"]
    // @return: applymapping1
    // @inputs: [frame = datasource0]
    val applymapping1 = datasource0.applyMapping(mappings = Seq(("exchangeid", "int", "exchangeid", "int"), ("data", "struct", "data", "struct")), caseSensitive = false, transformationContext = "applymapping1")
    // @type: DataSink
    // @args: [connection_type = "s3", connection_options = {"path": "s3://spark-ticker-oregon/target", "compression": "gzip"}, format = "json", transformation_ctx = "datasink2"]
    // @return: datasink2
    // @inputs: [frame = applymapping1]
    val datasink2 = glueContext.getSinkWithFormat(connectionType = "s3", options = JsonOptions("""{"path": "s3://spark-ticker-oregon/target", "compression": "gzip"}"""), transformationContext = "datasink2", format = "json").writeDynamicFrame(applymapping1)
    Job.commit()
  }
}

And the build.sbt I have started putting together for a local build:

name := "aws-glue-scala"

version := "0.1"

scalaVersion := "2.11.12"

updateOptions := updateOptions.value.withCachedResolution(true)

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.1"

The documentation for AWS Glue Scala API seems to outline similar functionality as is available in the AWS Glue Python library. So perhaps all that is required is to download and build the PySpark AWS Glue library and add it on the classpath? Perhaps possible since the Glue python library uses Py4J.


回答1:


now it supports, a recent release from AWS.

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html




回答2:


@Frederic gave a very helpful hint to get the dependency from s3://aws-glue-jes-prod-us-east-1-assets/etl/jars/glue-assembly.jar.

Unfortunately that version of glue-assembly.jar is already outdated and brings spark in versoin 2.1. It's fine if you're using backward compatible features, but if you rely on latest spark version (and possibly latest glue features) you can get the appropriate jar from a Glue dev-endpoint under /usr/share/aws/glue/etl/jars/glue-assembly.jar.

Provided you have a dev-endpoint named my-dev-endpoint you can copy the current jar from it:

export DEV_ENDPOINT_HOST=`aws glue get-dev-endpoint --endpoint-name my-dev-endpoint --query 'DevEndpoint.PublicAddress' --output text`

scp -i dev-endpoint-private-key \
glue@$DEV_ENDPOINT_HOST:/usr/share/aws/glue/etl/jars/glue-assembly.jar .



回答3:


Unfortunately, there are no libraries available for Scala glue API. Already contacted amazon support and they are aware about this problem. However, they didn't provide any ETA for delivering API jar.




回答4:


As a workaround you can download the jar from S3. The S3 URI is s3://aws-glue-jes-prod-us-east-1-assets/etl/jars/glue-assembly.jar

See https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-repl.html



来源:https://stackoverflow.com/questions/49254077/how-to-set-up-a-local-development-environment-for-scala-spark-etl-to-run-in-aws

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!