amazon-s3

UPSERT in parquet Pyspark

假如想象 提交于 2020-07-19 01:59:52
问题 I have parquet files in s3 with the following partitions: year / month / date / some_id Using Spark (PySpark), each day I would like to kind of UPSERT the last 14 days - I would like to replace the existing data in s3 (one parquet file for each partition), but not to delete the days that are before 14 days.. I tried two save modes: append - wasn't good because it just adds another file. overwrite - is deleting the past data and data for other partitions. Is there any way or best practice to

UPSERT in parquet Pyspark

不羁的心 提交于 2020-07-19 01:58:45
问题 I have parquet files in s3 with the following partitions: year / month / date / some_id Using Spark (PySpark), each day I would like to kind of UPSERT the last 14 days - I would like to replace the existing data in s3 (one parquet file for each partition), but not to delete the days that are before 14 days.. I tried two save modes: append - wasn't good because it just adds another file. overwrite - is deleting the past data and data for other partitions. Is there any way or best practice to

CORS on canvas not working on Chrome and Safari

…衆ロ難τιáo~ 提交于 2020-07-18 17:00:12
问题 I'm trying to display images on a canvas. The images are hosted on S3. CORS is set up on AWS, and the CrossOrigin attribute is set to anonymous in the script. Everything works fine on Firefox — but images are not being loaded on Chrome and Safari. Errors Safari: Origin https://www.example.com is not allowed by Access-Control-Allow-Origin. http://mybucket.s3.amazonaws.com/bubble/foo.PNG [Error] Failed to load resource: Origin https://www.example.com is not allowed by Access-Control-Allow

What is the difference between S3.Client.upload_file() and S3.Client.upload_fileobj()?

末鹿安然 提交于 2020-07-18 10:24:25
问题 According to S3.Client.upload_file and S3.Client.upload_fileobj, upload_fileobj may sound faster. But does anyone know specifics? Should I just upload the file, or should I open the file in binary mode to use upload_fileobj ? In other words, import boto3 s3 = boto3.resource('s3') ### Version 1 s3.meta.client.upload_file('/tmp/hello.txt', 'mybucket', 'hello.txt') ### Version 2 with open('/tmp/hello.txt', 'rb') as data: s3.upload_fileobj(data, 'mybucket', 'hello.txt') Is version 1 or version 2

Is there a way to do a SQL dump from Amazon Redshift

蓝咒 提交于 2020-07-17 06:26:08
问题 Is there a way to do a SQL dump from Amazon Redshift? Could you use the SQL workbench/J client? 回答1: We are currently using Workbench/J successfuly with Redshift. Regarding dumps, at the time there is no schema export tool available in Redshift (pg_dump doesn't work), although data can always be extracted via queries. Hope to help. EDIT: Remember that things like sort and distribution keys are not reflected on the code generated by Workbench/J. Take a look to the system table pg_table_def to

Copy multiple files from s3 bucket

不羁的心 提交于 2020-07-16 22:09:58
问题 I am having trouble downloading multiple files from AWS S3 buckets to my local machine. I have all the filenames that I want to download and I do not want others. How can I do that ? Is there any kind of loop in aws-cli I can do some iteration ? There are couple hundreds files I need to download so that it seems not possible to use one single command that takes all filenames as arguments. 回答1: There is a bash script which can read all the filenames from a file filename.txt . #!/bin/bash set

Copy multiple files from s3 bucket

爷,独闯天下 提交于 2020-07-16 22:07:00
问题 I am having trouble downloading multiple files from AWS S3 buckets to my local machine. I have all the filenames that I want to download and I do not want others. How can I do that ? Is there any kind of loop in aws-cli I can do some iteration ? There are couple hundreds files I need to download so that it seems not possible to use one single command that takes all filenames as arguments. 回答1: There is a bash script which can read all the filenames from a file filename.txt . #!/bin/bash set

HTML video loop re-downloads video file

故事扮演 提交于 2020-07-15 04:41:01
问题 I have an HTML5 video that is rather large. I'm also using Chrome. The video element has the loop attribute but each time the video "loops", the browser re-downloads the video file. I have set Cache-Control "max-age=15768000, private" . However, this does not prevent any extra downloads of the identical file. I am using Amazon S3 to host the file. Also the s3 server responds with the Accepts Ranges header which causes the several hundred partial downloads of the file to be requested with the

AWS Glue Crawler cannot parse large files (classification UNKNOWN)

ε祈祈猫儿з 提交于 2020-07-10 10:27:29
问题 I've been working on trying to use the crawler from AWS Glue to try to obtain the columns and other features of a certain json file. I've parsed the json file locally by converting it to UTF-8 and using boto3 to move it into an s3 container and accessing that container from the crawler. I created a json classifier with the custom classifier $[*] and created a crawler with normal settings. When I do this with a file that is relatively small (<50 Kb) the crawler correctly identifies the columns

Issue with quicksight data set unable to import data from data source and also ingestion problem dataset is created using createDataSet api

一世执手 提交于 2020-07-10 08:31:04
问题 ArrayList<ResourcePermission> permissions=new ArrayList<>(); ArrayList<String> action= new ArrayList<>(); action.add("quicksight:UpdateDataSourcePermissions"); action.add("quicksight:DescribeDataSource"); action.add("quicksight:DescribeDataSourcePermissions"); action.add("quicksight:PassDataSource"); action.add("quicksight:UpdateDataSource"); action.add("quicksight:DeleteDataSource"); permissions.add(new ResourcePermission().withPrincipal(PrincipalArn).withActions(action)); return(getClient()