aws-glue | 易学教程

how can aws glue job upload several tables in redshift

阅读更多关于 how can aws glue job upload several tables in redshift

问题 Is it possible to load multiple tables in Redshift using AWS Glue job? These are the steps I followed. Crawled json from S3 and the data has been translated into data catalog table. I created a job that will upload the data catalog table in redshift but it only limits me to upload 1 table for every job. In the job properties (in adding a job), This job runs option I chose is: A proposed script generated by AWS Glue. I am not familiar with python and I am new to AWS Glue. but I have several

AWS Glue issue with double quote and commas

阅读更多关于 AWS Glue issue with double quote and commas

问题 I have this CSV file: reference,address V7T452F4H9,"12410 W 62TH ST, AA D" The following options are being used in the table definition ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( 'quoteChar'='\"', 'separatorChar'=',') but it still won't recognize the double quotes in the data, and that comma in the double quote fiel is messing up the data. When I run the Athena query, the result looks like this reference address V7T452F4H9 "12410 W 62TH ST How do I

Unable to run scripts properly in AWS Glue PySpark Dev Endpoint

阅读更多关于 Unable to run scripts properly in AWS Glue PySpark Dev Endpoint

问题 I've configured an AWS Glue dev endpoint and can connect to it successfully in a pyspark REPL shell - like this https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-repl.html Unlike the example given in the AWS documentation I receive WARNings when I begin the session, and later on various operations on AWS Glue DynamicFrame structures fail. Here's the full log on starting the session - note the errors about spark.yarn.jars and PyGlue.zip: Python 2.7.12 (default, Sep 1 2016, 22:14

Add a partition on glue table via API on AWS?

阅读更多关于 Add a partition on glue table via API on AWS?

问题 I have an S3 bucket which is constantly being filled with new data, I am using Athena and Glue to query that data, the thing is if glue doesn't know that a new partition is created it doesn't search that it needs to search there. If I make an API call to run the Glue crawler each time I need a new partition is too expensive so the best solution to do this is to tell glue that a new partition is added i.e to create a new partition is in it's properties table. I looked through AWS documentation

AWS Glue Job Bookmarking

阅读更多关于 AWS Glue Job Bookmarking

问题 Wanted to see if there are more details about the way job bookmarking is done in AWS Glue. AWS docs doesn't provide much on this. I know that there are basic functionality in there: enable disable pause reset And it seems like that the bookmarking happens at the time: job.commit() Can I access it? Can it be modified to reprocess some portion of source? 回答1: Some additional info: The basic tactic for Job Bookmark design would be to save the START time of the last completed job. So when a job

How configure glue bookmars to work with scala code?

阅读更多关于 How configure glue bookmars to work with scala code?

问题 Consider scala code: import com.amazonaws.services.glue.GlueContext import com.amazonaws.services.glue.util.{GlueArgParser, Job, JsonOptions} import org.apache.spark.SparkContext import scala.collection.JavaConverters.mapAsJavaMapConverter object MyGlueJob { def main(sysArgs: Array[String]) { val spark: SparkContext = SparkContext.getOrCreate() val glueContext: GlueContext = new GlueContext(spark) val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray) Job.init(args("JOB

AWS Glue ETL job from AWS Redshift to S3 fails

阅读更多关于 AWS Glue ETL job from AWS Redshift to S3 fails

问题 I am trying out AWS Glue service to ETL some data from redshift to S3. Crawler runs successfully and creates the meta table in data catalog, however when I run the ETL job ( generated by AWS ) it fails after around 20 minutes saying "Resource unavailable". I cannot see AWS glue logs or error logs created in Cloudwatch. When I try to view them it says "Log stream not found. The log stream jr_xxxxxxxxxx could not be found. Check if it was correctly created and retry." I would appreciate it if

PySpark: How to add columns whose data come from a query (similar to subquery for each row)

阅读更多关于 PySpark: How to add columns whose data come from a query (similar to subquery for each row)

问题 I have a holidays table start: Date end: Date type: Enum(HOLIDAY|LONG_WEEKENDS) Some example data: "start","end","type" "2019-01-01","2019-01-01","HOLIDAY" "2019-02-05","2019-02-06","HOLIDAY" "2019-03-16","2019-03-24","HOLIDAY" "2019-04-19","2019-04-19","HOLIDAY" "2019-10-04","2019-10-04","HOLIDAY" "2019-08-08","2019-08-13","LONG_WEEKENDS" "2019-10-25","2019-10-29","LONG_WEEKENDS" "2019-12-20","2020-01-02","LONG_WEEKENDS" And a flights table, for simplicity, it has id: varchar out_date: Date

Glue AWS creating a data catalog table on boto3 python

阅读更多关于 Glue AWS creating a data catalog table on boto3 python

问题 I have been trying to create a table within our data catalog using the python API. Following the documentation posted here and here for the API. I can understand how that goes. Nevertheless, I need to undestand how to declare a field structure when I create the table because when I take a look on the Storage Definition for the table here there is any explanation about how should I define this type of column for my table. In addition. I dont see the classification property for the table where

HIVE_UNKNOWN_ERROR when running AWS Athena query on Glue table (RDS)

阅读更多关于 HIVE_UNKNOWN_ERROR when running AWS Athena query on Glue table (RDS)

问题 I'm getting an error when running an Athena query against a Glue table created from an RDS database: HIVE_UNKNOWN_ERROR: Unable to create input format The tables are created using a crawler. The tables show up correctly in the Glue interface: However, they do not show up in the Athena interface under the database. It says: "The selected database has no tables" I do not see this behaviour when using a database created using an S3 file. Maybe this is related to the error. Does anybody have an