aws-glue

how can aws glue job upload several tables in redshift

家住魔仙堡 提交于 2019-12-10 18:17:06
问题 Is it possible to load multiple tables in Redshift using AWS Glue job? These are the steps I followed. Crawled json from S3 and the data has been translated into data catalog table. I created a job that will upload the data catalog table in redshift but it only limits me to upload 1 table for every job. In the job properties (in adding a job), This job runs option I chose is: A proposed script generated by AWS Glue. I am not familiar with python and I am new to AWS Glue. but I have several

AWS Glue issue with double quote and commas

蹲街弑〆低调 提交于 2019-12-10 17:37:57
问题 I have this CSV file: reference,address V7T452F4H9,"12410 W 62TH ST, AA D" The following options are being used in the table definition ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( 'quoteChar'='\"', 'separatorChar'=',') but it still won't recognize the double quotes in the data, and that comma in the double quote fiel is messing up the data. When I run the Athena query, the result looks like this reference address V7T452F4H9 "12410 W 62TH ST How do I

Unable to run scripts properly in AWS Glue PySpark Dev Endpoint

痴心易碎 提交于 2019-12-10 17:15:31
问题 I've configured an AWS Glue dev endpoint and can connect to it successfully in a pyspark REPL shell - like this https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-repl.html Unlike the example given in the AWS documentation I receive WARNings when I begin the session, and later on various operations on AWS Glue DynamicFrame structures fail. Here's the full log on starting the session - note the errors about spark.yarn.jars and PyGlue.zip: Python 2.7.12 (default, Sep 1 2016, 22:14

Add a partition on glue table via API on AWS?

我是研究僧i 提交于 2019-12-10 15:55:29
问题 I have an S3 bucket which is constantly being filled with new data, I am using Athena and Glue to query that data, the thing is if glue doesn't know that a new partition is created it doesn't search that it needs to search there. If I make an API call to run the Glue crawler each time I need a new partition is too expensive so the best solution to do this is to tell glue that a new partition is added i.e to create a new partition is in it's properties table. I looked through AWS documentation

AWS Glue Job Bookmarking

不问归期 提交于 2019-12-10 12:16:14
问题 Wanted to see if there are more details about the way job bookmarking is done in AWS Glue. AWS docs doesn't provide much on this. I know that there are basic functionality in there: enable disable pause reset And it seems like that the bookmarking happens at the time: job.commit() Can I access it? Can it be modified to reprocess some portion of source? 回答1: Some additional info: The basic tactic for Job Bookmark design would be to save the START time of the last completed job. So when a job

How configure glue bookmars to work with scala code?

好久不见. 提交于 2019-12-09 23:31:51
问题 Consider scala code: import com.amazonaws.services.glue.GlueContext import com.amazonaws.services.glue.util.{GlueArgParser, Job, JsonOptions} import org.apache.spark.SparkContext import scala.collection.JavaConverters.mapAsJavaMapConverter object MyGlueJob { def main(sysArgs: Array[String]) { val spark: SparkContext = SparkContext.getOrCreate() val glueContext: GlueContext = new GlueContext(spark) val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray) Job.init(args("JOB

AWS Glue ETL job from AWS Redshift to S3 fails

拈花ヽ惹草 提交于 2019-12-08 16:00:46
问题 I am trying out AWS Glue service to ETL some data from redshift to S3. Crawler runs successfully and creates the meta table in data catalog, however when I run the ETL job ( generated by AWS ) it fails after around 20 minutes saying "Resource unavailable". I cannot see AWS glue logs or error logs created in Cloudwatch. When I try to view them it says "Log stream not found. The log stream jr_xxxxxxxxxx could not be found. Check if it was correctly created and retry." I would appreciate it if

PySpark: How to add columns whose data come from a query (similar to subquery for each row)

戏子无情 提交于 2019-12-08 14:09:25
问题 I have a holidays table start: Date end: Date type: Enum(HOLIDAY|LONG_WEEKENDS) Some example data: "start","end","type" "2019-01-01","2019-01-01","HOLIDAY" "2019-02-05","2019-02-06","HOLIDAY" "2019-03-16","2019-03-24","HOLIDAY" "2019-04-19","2019-04-19","HOLIDAY" "2019-10-04","2019-10-04","HOLIDAY" "2019-08-08","2019-08-13","LONG_WEEKENDS" "2019-10-25","2019-10-29","LONG_WEEKENDS" "2019-12-20","2020-01-02","LONG_WEEKENDS" And a flights table, for simplicity, it has id: varchar out_date: Date

Glue AWS creating a data catalog table on boto3 python

荒凉一梦 提交于 2019-12-08 06:13:33
问题 I have been trying to create a table within our data catalog using the python API. Following the documentation posted here and here for the API. I can understand how that goes. Nevertheless, I need to undestand how to declare a field structure when I create the table because when I take a look on the Storage Definition for the table here there is any explanation about how should I define this type of column for my table. In addition. I dont see the classification property for the table where

HIVE_UNKNOWN_ERROR when running AWS Athena query on Glue table (RDS)

可紊 提交于 2019-12-08 02:44:51
问题 I'm getting an error when running an Athena query against a Glue table created from an RDS database: HIVE_UNKNOWN_ERROR: Unable to create input format The tables are created using a crawler. The tables show up correctly in the Glue interface: However, they do not show up in the Athena interface under the database. It says: "The selected database has no tables" I do not see this behaviour when using a database created using an S3 file. Maybe this is related to the error. Does anybody have an