amazon-athena | 易学教程

Query results difference between EMR-Presto and Athena

阅读更多关于 Query results difference between EMR-Presto and Athena

问题 I have connected Glue catalog to Athena and an EMR instance (with presto installed). I tried running the same query on both but am getting different results. EMR is giving 0 rows but Athena is giving 43 rows. The query is pretty simple with a left join , group by and a count distinct . The query looks like this: select t1.customer_id as id, t2.purchase_date as purchase_date, count(distinct t1.purchase_id) as item_count from table1 t1 left join table2 as t2 on t2.purchase_id=t1.purchase_id

Error on query parsing alb logs by datetime in aws athena

阅读更多关于 Error on query parsing alb logs by datetime in aws athena

问题 I have followed the steps mentioned in the link to create the ALB table in Athena. I am trying to query the logs on the basis of datetime but I am getting below error. Query SELECT client_ip, sum(received_bytes) FROM default.alb_logs WHERE parse_datetime(time,'yyyy-MM-dd''T''HH:mm:ss.SSSS''Z') BETWEEN parse_datetime('2018-08-27-12:00:00','yyyy-MM-dd-HH:mm:ss') AND parse_datetime('2018-08-28-12:00:00','yyyy-MM-dd-HH:mm:ss') GROUP BY client_ip Error: Your query has the following error(s):

AWS Athena JSON Multidimentional Array Structure

阅读更多关于 AWS Athena JSON Multidimentional Array Structure

问题 The JSON file has a structure like this: "otherstuff" : "stuff", "ArrayofArrays" : { "Array-1" : { "type" : "sometype", "is_enabled" : false, "is_active" : false, "version" : "version 1.1" }, "Array-2" : { "type" : "sometype", "is_enabled" : false, "is_active" : false, "version" : "version 1.2" } ... } The query runs when with the following CREATE EXTERNAL TABLE IF NOT EXISTS test2.table14 ( `otherstuff` string, `ArrayofArrays` array<array<struct<version:string>>> ) ROW FORMAT SERDE 'org

Running query containing pseudo column from aws athena cli

阅读更多关于 Running query containing pseudo column from aws athena cli

问题 With reference to the below post, How to get input file name as column in AWS Athena external tables I tried running the query using the aws athena cli command as below, aws athena start-query-execution --query-string "SELECT regexp_extract(\ "$path\", '[^/]+$') AS filename FROM table" --query-execution-context '{"Database": "testdatabase"}' --result-configuration '{ "OutputLocation": "s3://<somevalidbucket>"}' I always get the query executed with empty value for $path. e.g., "SELECT regexp

What's the data format of Athena's .csv.metadata files?

阅读更多关于 What's the data format of Athena's .csv.metadata files?

问题 What's the data format of the .csv.metadata files written by Amazon Athena? Alongside the output file of every query there is a metadata file. It looks like it describes the schema of the result. I assume this is what Athena uses to create the ResultSet.ResultSetMetadata part of the response of GetQueryResults requests, and that it is somehow created by Hive or Presto. 2019-04-23 14:51:29 27 e7629796-9b91-476a-bfb7-2fe6c9595bce.csv 2019-04-23 14:51:29 56 e7629796-9b91-476a-bfb7-2fe6c9595bce

Apache superset: cannot read metadata from Athena

阅读更多关于 Apache superset: cannot read metadata from Athena

问题 I am trying to access Athena from superset, the connection is successful and could see all the schema and tables in SQL editior(Enabled expose this db in SQL lab). On SQL editor while loading the metadata it returns following error: ERROR OCCURRED WHILE FETCHING TABLE METADATA On Athena, it runs the following query SELECT table_schema, table_name, column_name, data_type, is_nullable, column_default, ordinal_position, comment FROM information_schema.columns And this query return following

Athena: Minimize data scanned by query including JOIN operation

阅读更多关于 Athena: Minimize data scanned by query including JOIN operation

问题 Let there be an external table in Athena which points to a large amount of data stored in parquet format on s3. It contains a lot of columns and is partitioned on a field called 'timeid'. Now, there's another external table (small one) which maps timeid to date. When the smaller table is also partitioned on timeid and we join them on their partition id (timeid) and put date into where clause, only those specific records are scanned from large table which contain timeids corresponding to that

Split one row into multiple rows based on comma-separated string column

阅读更多关于 Split one row into multiple rows based on comma-separated string column

问题 I have a table like below with columns A(int) and B(string) : A B 1 a,b,c 2 d,e 3 f,g,h I want to create an output like below: A B 1 a 1 b 1 c 2 d 2 e 3 f 3 g 3 h If it helps, I am doing this in Amazon Athena (which is based on presto). I know that presto gives a function to split a string into an array. From presto docs: split(string, delimiter) → array Splits string on delimiter and returns an array. Not sure how to proceed from here though. 回答1: Use unnest on the array returned by split .

AWS Athena: does `msck repair table` incur costs?

阅读更多关于 AWS Athena: does `msck repair table` incur costs?

问题 I have ORC data in S3 that looks like this: s3://bucket/orc/clientId=client-1/year=2017/month=3/day=16/hour=20/ s3://bucket/orc/clientId=client-2/year=2017/month=3/day=16/hour=21/ s3://bucket/orc/clientId=client-3/year=2017/month=3/day=16/hour=22/ Every hour I run an EMR job that converts raw JSON in S3 to ORC, and write it out with the path partition convention (above) for Athena ingestion. After the EMR job completes, I run msck repair table so Athena can pick up the new partitions. I have

AWS Athena json_extract query from string field returns empty values

阅读更多关于 AWS Athena json_extract query from string field returns empty values

问题 I have a table in athena with this structure CREATE EXTERNAL TABLE `json_test`( `col0` string , `col1` string , `col2` string , `col3` string , `col4` string , ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( 'quoteChar'='\"', 'separatorChar'='\;') A Json String like this is stored in "col4": {'email': 'test_email@test_email.com', 'name': 'Andrew', 'surname': 'Test Test'} I´m trying to make a json_extract query: SELECT json_extract(col4 , '$.email') as