amazon-athena

Query results difference between EMR-Presto and Athena

感情迁移 提交于 2019-12-11 13:05:04
问题 I have connected Glue catalog to Athena and an EMR instance (with presto installed). I tried running the same query on both but am getting different results. EMR is giving 0 rows but Athena is giving 43 rows. The query is pretty simple with a left join , group by and a count distinct . The query looks like this: select t1.customer_id as id, t2.purchase_date as purchase_date, count(distinct t1.purchase_id) as item_count from table1 t1 left join table2 as t2 on t2.purchase_id=t1.purchase_id

Error on query parsing alb logs by datetime in aws athena

守給你的承諾、 提交于 2019-12-11 12:46:13
问题 I have followed the steps mentioned in the link to create the ALB table in Athena. I am trying to query the logs on the basis of datetime but I am getting below error. Query SELECT client_ip, sum(received_bytes) FROM default.alb_logs WHERE parse_datetime(time,'yyyy-MM-dd''T''HH:mm:ss.SSSS''Z') BETWEEN parse_datetime('2018-08-27-12:00:00','yyyy-MM-dd-HH:mm:ss') AND parse_datetime('2018-08-28-12:00:00','yyyy-MM-dd-HH:mm:ss') GROUP BY client_ip Error: Your query has the following error(s):

AWS Athena JSON Multidimentional Array Structure

情到浓时终转凉″ 提交于 2019-12-11 12:26:04
问题 The JSON file has a structure like this: "otherstuff" : "stuff", "ArrayofArrays" : { "Array-1" : { "type" : "sometype", "is_enabled" : false, "is_active" : false, "version" : "version 1.1" }, "Array-2" : { "type" : "sometype", "is_enabled" : false, "is_active" : false, "version" : "version 1.2" } ... } The query runs when with the following CREATE EXTERNAL TABLE IF NOT EXISTS test2.table14 ( `otherstuff` string, `ArrayofArrays` array<array<struct<version:string>>> ) ROW FORMAT SERDE 'org

Running query containing pseudo column from aws athena cli

﹥>﹥吖頭↗ 提交于 2019-12-11 06:19:57
问题 With reference to the below post, How to get input file name as column in AWS Athena external tables I tried running the query using the aws athena cli command as below, aws athena start-query-execution --query-string "SELECT regexp_extract(\ "$path\", '[^/]+$') AS filename FROM table" --query-execution-context '{"Database": "testdatabase"}' --result-configuration '{ "OutputLocation": "s3://<somevalidbucket>"}' I always get the query executed with empty value for $path. e.g., "SELECT regexp

What's the data format of Athena's .csv.metadata files?

纵然是瞬间 提交于 2019-12-11 04:25:01
问题 What's the data format of the .csv.metadata files written by Amazon Athena? Alongside the output file of every query there is a metadata file. It looks like it describes the schema of the result. I assume this is what Athena uses to create the ResultSet.ResultSetMetadata part of the response of GetQueryResults requests, and that it is somehow created by Hive or Presto. 2019-04-23 14:51:29 27 e7629796-9b91-476a-bfb7-2fe6c9595bce.csv 2019-04-23 14:51:29 56 e7629796-9b91-476a-bfb7-2fe6c9595bce

Apache superset: cannot read metadata from Athena

£可爱£侵袭症+ 提交于 2019-12-11 04:04:24
问题 I am trying to access Athena from superset, the connection is successful and could see all the schema and tables in SQL editior(Enabled expose this db in SQL lab). On SQL editor while loading the metadata it returns following error: ERROR OCCURRED WHILE FETCHING TABLE METADATA On Athena, it runs the following query SELECT table_schema, table_name, column_name, data_type, is_nullable, column_default, ordinal_position, comment FROM information_schema.columns And this query return following

Athena: Minimize data scanned by query including JOIN operation

こ雲淡風輕ζ 提交于 2019-12-11 00:01:34
问题 Let there be an external table in Athena which points to a large amount of data stored in parquet format on s3. It contains a lot of columns and is partitioned on a field called 'timeid'. Now, there's another external table (small one) which maps timeid to date. When the smaller table is also partitioned on timeid and we join them on their partition id (timeid) and put date into where clause, only those specific records are scanned from large table which contain timeids corresponding to that

Split one row into multiple rows based on comma-separated string column

混江龙づ霸主 提交于 2019-12-10 23:55:14
问题 I have a table like below with columns A(int) and B(string) : A B 1 a,b,c 2 d,e 3 f,g,h I want to create an output like below: A B 1 a 1 b 1 c 2 d 2 e 3 f 3 g 3 h If it helps, I am doing this in Amazon Athena (which is based on presto). I know that presto gives a function to split a string into an array. From presto docs: split(string, delimiter) → array Splits string on delimiter and returns an array. Not sure how to proceed from here though. 回答1: Use unnest on the array returned by split .

AWS Athena: does `msck repair table` incur costs?

懵懂的女人 提交于 2019-12-10 21:48:02
问题 I have ORC data in S3 that looks like this: s3://bucket/orc/clientId=client-1/year=2017/month=3/day=16/hour=20/ s3://bucket/orc/clientId=client-2/year=2017/month=3/day=16/hour=21/ s3://bucket/orc/clientId=client-3/year=2017/month=3/day=16/hour=22/ Every hour I run an EMR job that converts raw JSON in S3 to ORC, and write it out with the path partition convention (above) for Athena ingestion. After the EMR job completes, I run msck repair table so Athena can pick up the new partitions. I have

AWS Athena json_extract query from string field returns empty values

天涯浪子 提交于 2019-12-10 19:06:49
问题 I have a table in athena with this structure CREATE EXTERNAL TABLE `json_test`( `col0` string , `col1` string , `col2` string , `col3` string , `col4` string , ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( 'quoteChar'='\"', 'separatorChar'='\;') A Json String like this is stored in "col4": {'email': 'test_email@test_email.com', 'name': 'Andrew', 'surname': 'Test Test'} I´m trying to make a json_extract query: SELECT json_extract(col4 , '$.email') as