aws-glue

Glue Job fails to write file

烈酒焚心 提交于 2019-12-13 03:39:01
问题 I am back filling some data via glue jobs. The job itself is reading in a TSV from s3, transforming the data slightly, and writing it in Parquet to S3. Since I already have the data, I am trying to launch multiple jobs at once to reduce the amount of time needed to process it all. When I launch multiple jobs at the same time, I run into an issue sometimes where one of the files will fail to output the resultant Parquet files in S3. The job itself completes successfully without throwing an

arguments error while calling an AWS Glue Pythonshell job from boto3

China☆狼群 提交于 2019-12-13 03:19:23
问题 Based on the previous post, I have an AWS Glue Pythonshell job that needs to retrieve some information from the arguments that are passed to it through a boto3 call. My Glue job name is test_metrics The Glue pythonshell code looks like below import sys from awsglue.utils import getResolvedOptions args = getResolvedOptions(sys.argv, ['test_metrics', 's3_target_path_key', 's3_target_path_value']) print ("Target path key is: ", args['s3_target_path_key']) print ("Target Path value is: ", args[

AWS Glue to Redshift: duplicate data?

倾然丶 夕夏残阳落幕 提交于 2019-12-13 02:42:36
问题 Here are some bullet points in terms of how I have things setup: I have CSV files uploaded to S3 and a Glue crawler setup to create the table and schema. I have a Glue job setup that writes the data from the Glue table to our Amazon Redshift database using a JDBC connection. The Job also is in charge of mapping the columns and creating the redshift table. By re-running a job, I am getting duplicate rows in redshift (as expected). However, is there way to replace or delete rows before

AWS Glue pyspark UDF

我与影子孤独终老i 提交于 2019-12-12 19:05:40
问题 In AWS Glue, I need to convert a float value (celsius to fahrenheit) and am using an UDF. Following is my UDF: toFahrenheit = udf(lambda x: '-1' if x in not_found else x * 9 / 5 + 32, StringType()) I am using the UDF as follows, in the spark dataframe: weather_df.withColumn("new_tmax", toFahrenheit(weather_df["tmax"])).drop("tmax").withColumnRenamed("new_tmax","tmax") When I run the code, am getting the error message as : IllegalArgumentException: u"requirement failed: The number of columns

Firehose JSON -> S3 Parquet -> ETL Spark, error: Unable to infer schema for Parquet

我与影子孤独终老i 提交于 2019-12-12 17:11:51
问题 It seems like this should be easy, like it's a core use case of this set of features, but it's been problem after problem. The latest is in trying to run commands via a Glue Dev endpoint (both the PySpark and Scala end-points). Following the instructions here: https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-repl.html import sys from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.transforms import * glueContext = GlueContext

AWS Glue: How to add a column with the source filename in the output?

点点圈 提交于 2019-12-12 11:33:27
问题 Does anyone know of a way to add the source filename as a column in a Glue job? We created a flow where we crawled some files in S3 to create a schema. We then wrote a job that transforms the files to a new format, and the writes those files back to another S3 bucket as CSV, to be used by the rest of our pipeline. What we would like to do is get access to some sort of job meta properties so we can add a new column to the output file that contains the original filename. I looked through the

How do I write messages to the output log on AWS Glue?

感情迁移 提交于 2019-12-12 08:21:02
问题 AWS Glue jobs log output and errors to two different CloudWatch logs, /aws-glue/jobs/error and /aws-glue/jobs/output by default. When I include print() statements in my scripts for debugging, they get written to the error log ( /aws-glue/jobs/error ). I have tried using: log4jLogger = sparkContext._jvm.org.apache.log4j log = log4jLogger.LogManager.getLogger(__name__) log.warn("Hello World!") but "Hello World!" doesn't show up in either of the logs for the test job I ran. Does anyone know how

AWS Glue Bookmark produces duplicates

心不动则不痛 提交于 2019-12-11 18:33:16
问题 I am submitting a Python script (pyspark actually) to a Glue Job to process parquet files and extract some analytics from this data source. These parquet files live on an S3 folder and continuously increase with new data. I was happy with the logic of bookmarking provided by AWS Glue because it helps a lot: basically allows us to process only new data without reprocessing already processed data. Unfortunately in this scenario I notice instead that each time duplicates are produced and looks

ImportError: No module named pg8000

亡梦爱人 提交于 2019-12-11 18:26:28
问题 I use aws-glue now and would like to connect aws-glue to aws-aurora(Postgres)! So, I created aws-glew job to connect aws-glue to aws-aurora(Postgres) by using pg8000. But I get the error message like title: ImportError: No module named pg8000 No module named pg8000 When creating job, I set parameter of python library through S3. How can I solve this problem? And how can I connect aws-glue to aws-aurora(Postgres)? 回答1: It looks like you can connect to Aurora Postgres only through an Amazon RDS

AWS Glue Error | Not able to read Glue tables from Developer End points using spark

杀马特。学长 韩版系。学妹 提交于 2019-12-11 18:09:17
问题 I am not able to access AWS Glue tables even if I given all required IAM permissions. I cant even list all the databases.Here is the code. import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job # New recommendation from AWS Support 2018-03-22 newconf = sc._conf.set("spark.sql.catalogImplementation", "in-memory") sc.stop() sc = sc.getOrCreate(newconf) #