aws-glue | 易学教程

AWS Glue Crawler adding tables for every partition?

阅读更多关于 AWS Glue Crawler adding tables for every partition?

问题 I have several thousand files in an S3 bucket in this form: ├── bucket │ ├── somedata │ │ ├── year=2016 │ │ ├── year=2017 │ │ │ ├── month=11 │ │ | │ ├── sometype-2017-11-01.parquet │ | | | ├── sometype-2017-11-02.parquet │ | | | ├── ... │ │ │ ├── month=12 │ │ | │ ├── sometype-2017-12-01.parquet │ | | | ├── sometype-2017-12-02.parquet │ | | | ├── ... │ │ ├── year=2018 │ │ │ ├── month=01 │ │ | │ ├── sometype-2018-01-01.parquet │ | | | ├── sometype-2018-01-02.parquet │ | | | ├── ... │ ├──

Error while using INSERT INTO table ON DUPLICATE KEY, using a for loop array

阅读更多关于 Error while using INSERT INTO table ON DUPLICATE KEY, using a for loop array

问题 I am working on updating a mysql database using pyspark framework, and running on AWS Glue services. I have a dataframe as follows: df2= sqlContext.createDataFrame([("xxx1","81A01","TERR NAME 55","NY"),("xxx2","81A01","TERR NAME 55","NY"),("x103","81A01","TERR NAME 01","NJ")], ["zip_code","territory_code","territory_name","state"]) # Print out information about this data df2.show() +--------+--------------+--------------+-----+ |zip_code|territory_code|territory_name|state| +--------+--------

AWS Glue DPU configurations

阅读更多关于 AWS Glue DPU configurations

问题 I see that the DPU is made of 4 vCPUs & 16 GB memory. Is it possible to change this settings for vCPU, memory, so that I don't run out of DPUs or exceed the DPU limit. I think there is a maximum limit of 5 DPUs for a Dev Endpoint, and a maximum of 2 DEV Endpoints for an account? Regards Yuva 回答1: Right now there is no way to configure DPU memory, but you can request a limit increase on your account to be able to use more DPUs. 回答2: As of April 2019, there are two new types of workers: You can

AWS Glue crawler change serde

阅读更多关于 AWS Glue crawler change serde

问题 I have csv's with quoted strings and the crawler by default registers the table with LazySimpleSerde.Is there anyway I can programmatically change this to use the OpenCSVSerde instead? 回答1: You can make use of boto3 which is an aws sdk. You can simply call their api using python or any other language. To precisely answer your question - You will need to call update_table() api to update serde used by glue table. https://boto3.readthedocs.io/en/latest/reference/services/glue.html#Glue.Client

Moving data from S3 -> RDS using AWS Glue

阅读更多关于 Moving data from S3 -> RDS using AWS Glue

问题 Does AWS Glue provide ability to move data from S3 bucket to RDS database? I'm trying to setup serverless app that picks up dynamic data uploaded to S3 and migrates it to RDS. Glue provides Crawlers service that determines schema. Glue also provides ETL Jobs, but this seems to be where target source is only another S3 bucket. Any ideas? 回答1: Yes, Glue can send to an RDS datastore. If you are using the job wizard it will give you a target option of "JDBC". If you select JDBC you can setup a

AWS Glue pricing against AWS EMR

阅读更多关于 AWS Glue pricing against AWS EMR

问题 I am doing some pricing comparison between AWS Glue against AWS EMR so as to chose between EMR & Glue. I have considered 6 DPUs (4 vCPUs + 16 GB Memory) with ETL Job running for 10 minutes for 30 days. Expected crawler requests is assumed to be 1 million above free tier and is calculated at $1 for the 1 million additional requests. On EMR I have considered m3.xlarge for both EC2 & EMR (pricing at $0.266 & $0.070 respectively) with 6 nodes, running for 10 minutes for 30 days. On calculating

AWS Glue pricing against AWS EMR

阅读更多关于 AWS Glue pricing against AWS EMR

display DataFrame when using pyspark aws glue

阅读更多关于 display DataFrame when using pyspark aws glue

问题 how can I show the DataFrame with job etl of aws glue? I tried this code below but doesn't display anything. df.show() code datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "flux-test", table_name = "tab1", transformation_ctx = "datasource0") sourcedf = ApplyMapping.apply(frame = datasource0, mappings = [("id", "long", "id", "long"),("Rd.Id_Releve", "string", "Rd.Id_R", "string")]) sourcedf = sourcedf.toDF() data = [] schema = StructType( [ StructField('PM', StructType([

display DataFrame when using pyspark aws glue

阅读更多关于 display DataFrame when using pyspark aws glue

AWS Glue Customized Crawler

阅读更多关于 AWS Glue Customized Crawler

问题 I've created an AWS Glue crawler to gather information on my Redshift Database. Is there a way I can customize this crawler to update the "comment" field in Glue with a field that all my tables have? This field would be the comment or description field that all Redshift tables have. Any help would be appreciated. Thanks 来源： https://stackoverflow.com/questions/59200724/aws-glue-customized-crawler