amazon-athena

Extract substrings from field sql/presto

左心房为你撑大大i 提交于 2020-01-05 03:46:10
问题 I have columns in my database that contains values separated by /. I am trying to extract certain values from columns and create new row with them. Example of data look like below; user/values2/class/year/subject/18/9/2000291.csv holiday/booking/type/1092/1921/1.csv drink/water/juice/1/232/89.json drink/water1/soft/90091/2/89.csv car/type/1/001/1.json game/mmo/1/2/3.json I want to extract the last 3 numbers from the data e.g., from user/values2/class/year/subject/18/9/2000291.csv I want x =

reduce the amount of data scanned by Athena when using aggregate functions

浪尽此生 提交于 2020-01-02 07:34:10
问题 The below query scans 100 mb of data. select * from table where column1 = 'val' and partition_id = '20190309'; However the below query scans 15 GB of data (there are over 90 partitions) select * from table where column1 = 'val' and partition_id in (select max(partition_id) from table); How can I optimize the second query to scan the same amount of data as the first? 回答1: There are two problems here. The efficiency of the the scalar subquery above select max(partition_id) from table , and the

Casting not working correctly in Amazon Athena (Presto)?

不羁的心 提交于 2019-12-31 05:15:06
问题 I have a doctor license registry dataset which includes the total_submitted_charge_amount for each doctor as well as the number of entitlements with medicare & medicaid . I used the query from the answer suggested below: with datamart AS (SELECT npi, provider_last_name, provider_first_name, provider_mid_initial, provider_address_1, provider_address_2, provider_city, provider_zipcode, provider_state_code, provider_country_code, provider_type, number_of_services, CASE WHEN REPLACE(num

AWS Athena: Delete partitions between date range

不打扰是莪最后的温柔 提交于 2019-12-25 01:37:11
问题 I have an athena table with partition based on date like this: 20190218 I want to delete all the partitions that are created last year. I tried the below query, but it didnt work. ALTER TABLE tblname DROP PARTITION (partition1 < '20181231'); ALTER TABLE tblname DROP PARTITION (partition1 > '20181010'), Partition (partition1 < '20181231'); 回答1: According to https://docs.aws.amazon.com/athena/latest/ug/alter-table-drop-partition.html, ALTER TABLE tblname DROP PARTITION takes a partition spec,

How to handle a generic internal error in Athena

微笑、不失礼 提交于 2019-12-25 01:12:01
问题 I have the following query used on one of my datasets in Athena. CREATE TABLE clean_table WITH (format='Parquet', external_location='s3://test123data') AS SELECT npi, provider_last_name, provider_first_name, CASE WHEN REPLACE(num_entitlement_medicare_medicaid,',', '') ='' THEN null ELSE CAST(REPLACE(num_entitlement_medicare_medicaid,',', '') AS DECIMAL) END AS medicare_medicaid_entitlement, CASE WHEN REPLACE(total_submitted_charge_amount,',', '') ='' THEN null ELSE CAST(REPLACE(num

Simplest tool in AWS for very simple (transform in) ETL?

痞子三分冷 提交于 2019-12-25 01:11:36
问题 We have numerous files in S3 totally tens of gigabytes. We need to get them into CSV format, currently the files have delimiters that are not commas. Normally I would do this on a server using sed but I don't want to have to transfer the files to a server, I want to read directly from S3, translate to CSV line by line, and write the results back to new S3 files. Glue appears to be able to do this but I sense the learning curve and setup for such a simple task is overkill. Is there not some

How do I convert a string which is actually a date with timezone to a timestamp in Presto?

烂漫一生 提交于 2019-12-24 20:00:56
问题 Example : 2017-12-24 23:59:59.000 PST This does not work select date_parse('2017-12-24 23:59:59.000 PST','%Y-%m-%d %T.%f %x') Sure I can truncate the TZ which will solve select date_parse(substr('2017-12-24 23:59:59.000 PST',1,23),'%Y-%m-%d %T.%f') Is there a way to do this without truncating the TZ ? 回答1: date_parse doesn't seem to support time zones, use parse_datetime instead: presto> select parse_datetime('2017-12-24 23:59:59.000 PST', 'YYYY-MM-dd HH:mm:ss.SSS z'); _col0 -----------------

AWS Glue Crawlers and large tables stored in S3

北战南征 提交于 2019-12-24 10:18:40
问题 I have some general question about AWS Glue and its crawlers. I have some data streams into S3 buckets and I use AWS Athena to access them as external tables in redshift. The tables are partitioned by hour, some glue crawlers update the partitions and the table structure every hour. The Problem is that the crawlers take longer and longer and someday they will not finish in less than an hour. Is there some setting in order to speed up this process or some proper alternative to the crawlers in

Lambda Python request athena error OutputLocation

允我心安 提交于 2019-12-24 07:55:00
问题 I'm working with AWS Lambda and I would like to make a simple query in athena and store my data in an s3. My code : import boto3 def lambda_handler(event, context): query_1 = "SELECT * FROM test_athena_laurent.stage limit 5;" database = "test_athena_laurent" s3_output = "s3://athena-laurent-result/lambda/" client = boto3.client('athena') response = client.start_query_execution( QueryString=query_1, ClientRequestToken='string', QueryExecutionContext={ 'Database': database },

S3 Query Exception (Fetch)

本小妞迷上赌 提交于 2019-12-24 07:18:16
问题 I have uploaded data from Redshift to S3 in Parquet format and created the data catalog in Glue. I have been able to query the table from Athena but when I create the external schema on Redshift and tried to query on the table I'm getting the below error ERROR: S3 Query Exception (Fetch) DETAIL: ----------------------------------------------- error: S3 Query Exception (Fetch) code: 15001 context: Task failed due to an internal error. File 'https://s3-eu-west-1.amazonaws.com/bucket/folder