aws-glue

AWS crawler could not classify the file type stores in S3 if its size >1MB

谁都会走 提交于 2019-12-24 09:12:54
问题 When iam trying to detect the file type using Crawler of size >=1MB of input Json file It creates a table in glue with is classification type is "Unknown". But when the size is <1MB it successfully classifies the file type as JSON. I crosschecked the file to ensure its a valid json file. It is something a limitation for aws crawler. If so is there any alternative to this issue. 回答1: Yes, that is by design of the crawler, if the meta data ( Internally crawler creates it) exceeds 1mb you'll get

AWS Glue Crawler Cannot Extract CSV Headers

霸气de小男生 提交于 2019-12-24 03:26:05
问题 At my wits end here... I have 15 csv files that I am generating from a beeline query like: beeline -u CONN_STR --outputformat=dsv -e "SELECT ... " > data.csv I chose dsv because some string fields include commas and they are not quoted, which breaks glue even more. Besides, according to the docs, the built in csv classifier can handle pipes (and for the most part, it does). Anyway, I upload these 15 csv files to an s3 bucket and run my crawler. Everything works great. For 14 of them. Glue is

Could we use AWS Glue just copy a file from one S3 folder to another S3 folder?

百般思念 提交于 2019-12-22 00:40:26
问题 I need to copy a zipped file from one AWS S3 folder to another and would like to make that a scheduled AWS Glue job. I cannot find an example for such a simple task. Please help if you know the answer. May be the answer is in AWS Lambda, or other AWS tools. Thank you very much! 回答1: You can do this, and there may be a reason to use AWS Glue: if you have chained Glue jobs and glue_job_#2 is triggered on the successful completion of glue_job_#1 . The simple Python script below moves a file from

AWS Glue pushdown predicate not working properly

微笑、不失礼 提交于 2019-12-21 06:26:41
问题 I'm trying to optimize my Glue/PySpark job by using push down predicates. start = date(2019, 2, 13) end = date(2019, 2, 27) print(">>> Generate data frame for ", start, " to ", end, "... ") relaventDatesDf = spark.createDataFrame([ Row(start=start, stop=end) ]) relaventDatesDf.createOrReplaceTempView("relaventDates") relaventDatesDf = spark.sql("SELECT explode(generate_date_series(start, stop)) AS querydatetime FROM relaventDates") relaventDatesDf.createOrReplaceTempView("relaventDates")

Using AWS Glue to convert very big csv.gz (30-40 gb each) to parquet

大兔子大兔子 提交于 2019-12-20 04:15:10
问题 There are lots of such questions but nothing seems to help. I am trying to covert quite large csv.gz files to parquet and keep on getting various errors like 'Command failed with exit code 1' or An error occurred while calling o392.pyWriteDynamicFrame. Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, ip-172-31-5-241.eu-central-1.compute.internal, executor 4): ExecutorLostFailure (executor 4 exited caused by one of

AWS Athena concurrency limits: Number of submitted queries VS number of running queries

丶灬走出姿态 提交于 2019-12-19 03:56:27
问题 According to AWS Athena limitations you can submit up to 20 queries of the same type at a time, but it is a soft limit and can be increased on request. I use boto3 to interact with Athena and my script submits 16 CTAS queries each of which takes about 2 minutes to finish. In a AWS account, it is only me who is using Athena service. However, when I look at the state of queries through console I see that only a few of queries (5 on average) are actually being executed despite all of them being

AWS Glue executor memory limit

五迷三道 提交于 2019-12-18 15:36:17
问题 I found that AWS Glue set up executor's instance with memory limit to 5 Gb --conf spark.executor.memory=5g and some times, on a big datasets it fails with java.lang.OutOfMemoryError . The same is for driver instance --spark.driver.memory=5g . Is there any option to increase this value? 回答1: The official glue documentation suggests that glue doesn't support custom spark config. There are also several argument names used by AWS Glue internally that you should never set: --conf — Internal to AWS

Use AWS Glue Python with NumPy and Pandas Python Packages

狂风中的少年 提交于 2019-12-17 16:28:08
问题 What is the easiest way to use packages such as NumPy and Pandas within the new ETL tool on AWS called Glue? I have a completed script within Python I would like to run in AWS Glue that utilizes NumPy and Pandas. 回答1: I think the current answer is you cannot . According to AWS Glue Documentation: Only pure Python libraries can be used. Libraries that rely on C extensions, such as the pandas Python Data Analysis Library, are not yet supported. But even when I try to include a normal python

Spark Catalog w/ AWS Glue: database not found

社会主义新天地 提交于 2019-12-13 16:43:27
问题 Ive created an EMR cluster with the Glue Data catalog. When I invoke the spark-shell, I am able to successfully list tables stored within a Glue database via spark.catalog.setCurrentDatabase("test") spark.catalog.listTables However when I submit a job via spark-submit I get a fatal error ERROR ApplicationMaster: User class threw exception: org.apache.spark.sql.AnalysisException: Database 'test' does not exist.; I am creating my SparkSession within the job being submitted via spark-submit via

[Amazon](500150) Error setting/closing connection: Connection timed out

谁说胖子不能爱 提交于 2019-12-13 03:56:45
问题 I am having connectivity issue from Glue console while trying to connect to Redshift Cluster. I am able to connect to Redshift cluster with exact credentials from my Desktop. I have followed the AWS documentation and have "ALL TCP" connections open for Security Groups in which Redshift cluster resides. Both Glue and Redshift are in same Region. Also Glue has been given AWSRedshiftFullAccess. I am running a wall and appreciate if you provide me guidance to resolve this issue. I followed the