aws-glue

AWS Glue Crawler adding tables for every partition?

我是研究僧i 提交于 2020-01-31 08:28:16
问题 I have several thousand files in an S3 bucket in this form: ├── bucket │ ├── somedata │ │ ├── year=2016 │ │ ├── year=2017 │ │ │ ├── month=11 │ │ | │ ├── sometype-2017-11-01.parquet │ | | | ├── sometype-2017-11-02.parquet │ | | | ├── ... │ │ │ ├── month=12 │ │ | │ ├── sometype-2017-12-01.parquet │ | | | ├── sometype-2017-12-02.parquet │ | | | ├── ... │ │ ├── year=2018 │ │ │ ├── month=01 │ │ | │ ├── sometype-2018-01-01.parquet │ | | | ├── sometype-2018-01-02.parquet │ | | | ├── ... │ ├──

Error while using INSERT INTO table ON DUPLICATE KEY, using a for loop array

大憨熊 提交于 2020-01-28 10:23:49
问题 I am working on updating a mysql database using pyspark framework, and running on AWS Glue services. I have a dataframe as follows: df2= sqlContext.createDataFrame([("xxx1","81A01","TERR NAME 55","NY"),("xxx2","81A01","TERR NAME 55","NY"),("x103","81A01","TERR NAME 01","NJ")], ["zip_code","territory_code","territory_name","state"]) # Print out information about this data df2.show() +--------+--------------+--------------+-----+ |zip_code|territory_code|territory_name|state| +--------+--------

AWS Glue DPU configurations

被刻印的时光 ゝ 提交于 2020-01-25 07:23:26
问题 I see that the DPU is made of 4 vCPUs & 16 GB memory. Is it possible to change this settings for vCPU, memory, so that I don't run out of DPUs or exceed the DPU limit. I think there is a maximum limit of 5 DPUs for a Dev Endpoint, and a maximum of 2 DEV Endpoints for an account? Regards Yuva 回答1: Right now there is no way to configure DPU memory, but you can request a limit increase on your account to be able to use more DPUs. 回答2: As of April 2019, there are two new types of workers: You can

AWS Glue crawler change serde

核能气质少年 提交于 2020-01-24 17:30:06
问题 I have csv's with quoted strings and the crawler by default registers the table with LazySimpleSerde.Is there anyway I can programmatically change this to use the OpenCSVSerde instead? 回答1: You can make use of boto3 which is an aws sdk. You can simply call their api using python or any other language. To precisely answer your question - You will need to call update_table() api to update serde used by glue table. https://boto3.readthedocs.io/en/latest/reference/services/glue.html#Glue.Client

Moving data from S3 -> RDS using AWS Glue

↘锁芯ラ 提交于 2020-01-23 17:13:06
问题 Does AWS Glue provide ability to move data from S3 bucket to RDS database? I'm trying to setup serverless app that picks up dynamic data uploaded to S3 and migrates it to RDS. Glue provides Crawlers service that determines schema. Glue also provides ETL Jobs, but this seems to be where target source is only another S3 bucket. Any ideas? 回答1: Yes, Glue can send to an RDS datastore. If you are using the job wizard it will give you a target option of "JDBC". If you select JDBC you can setup a

AWS Glue pricing against AWS EMR

妖精的绣舞 提交于 2020-01-21 03:20:29
问题 I am doing some pricing comparison between AWS Glue against AWS EMR so as to chose between EMR & Glue. I have considered 6 DPUs (4 vCPUs + 16 GB Memory) with ETL Job running for 10 minutes for 30 days. Expected crawler requests is assumed to be 1 million above free tier and is calculated at $1 for the 1 million additional requests. On EMR I have considered m3.xlarge for both EC2 & EMR (pricing at $0.266 & $0.070 respectively) with 6 nodes, running for 10 minutes for 30 days. On calculating

AWS Glue pricing against AWS EMR

此生再无相见时 提交于 2020-01-21 03:20:07
问题 I am doing some pricing comparison between AWS Glue against AWS EMR so as to chose between EMR & Glue. I have considered 6 DPUs (4 vCPUs + 16 GB Memory) with ETL Job running for 10 minutes for 30 days. Expected crawler requests is assumed to be 1 million above free tier and is calculated at $1 for the 1 million additional requests. On EMR I have considered m3.xlarge for both EC2 & EMR (pricing at $0.266 & $0.070 respectively) with 6 nodes, running for 10 minutes for 30 days. On calculating

display DataFrame when using pyspark aws glue

自古美人都是妖i 提交于 2020-01-16 19:34:29
问题 how can I show the DataFrame with job etl of aws glue? I tried this code below but doesn't display anything. df.show() code datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "flux-test", table_name = "tab1", transformation_ctx = "datasource0") sourcedf = ApplyMapping.apply(frame = datasource0, mappings = [("id", "long", "id", "long"),("Rd.Id_Releve", "string", "Rd.Id_R", "string")]) sourcedf = sourcedf.toDF() data = [] schema = StructType( [ StructField('PM', StructType([

display DataFrame when using pyspark aws glue

对着背影说爱祢 提交于 2020-01-16 19:34:09
问题 how can I show the DataFrame with job etl of aws glue? I tried this code below but doesn't display anything. df.show() code datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "flux-test", table_name = "tab1", transformation_ctx = "datasource0") sourcedf = ApplyMapping.apply(frame = datasource0, mappings = [("id", "long", "id", "long"),("Rd.Id_Releve", "string", "Rd.Id_R", "string")]) sourcedf = sourcedf.toDF() data = [] schema = StructType( [ StructField('PM', StructType([

AWS Glue Customized Crawler

帅比萌擦擦* 提交于 2020-01-16 12:02:59
问题 I've created an AWS Glue crawler to gather information on my Redshift Database. Is there a way I can customize this crawler to update the "comment" field in Glue with a field that all my tables have? This field would be the comment or description field that all Redshift tables have. Any help would be appreciated. Thanks 来源: https://stackoverflow.com/questions/59200724/aws-glue-customized-crawler