amazon-athena

SerDe properties list for AWS Athena (JSON)

点点圈 提交于 2019-12-06 03:09:45
I'm testing the Athena product of AWS, so far is working very good. But I want to know the list of SerDe properties. I've searched far and wide and couldn't find it. I'm using this one for example "ignore.malformed.json" = "true" , but I'm pretty sure there are a ton of other options to tune the queries. I couldn't find info for example, on what the "path" property does, so having the full list will be amazing. I have looked at Apache Hive docs but couldn't find this, and neither on AWS docs/forums. Thanks! It seems you are using the Openx-JsonSerDe http://docs.aws.amazon.com/athena/latest/ug

reduce the amount of data scanned by Athena when using aggregate functions

社会主义新天地 提交于 2019-12-06 02:42:30
The below query scans 100 mb of data. select * from table where column1 = 'val' and partition_id = '20190309'; However the below query scans 15 GB of data (there are over 90 partitions) select * from table where column1 = 'val' and partition_id in (select max(partition_id) from table); How can I optimize the second query to scan the same amount of data as the first? There are two problems here. The efficiency of the the scalar subquery above select max(partition_id) from table , and the one @PiotrFindeisen pointed out around dynamic filtering. The the first problem is that queries over the

How to Query parquet data from Amazon Athena?

你说的曾经没有我的故事 提交于 2019-12-05 14:54:32
Athena creates a temporary table using fields in S3 table. I have done this using JSON data. Could you help me on how to create table using parquet data? I have tried following: Converted sample JSON data to parquet data. Uploaded parquet data to S3. Created temporary table using columns of JSON data. By doing this I am able to a execute query but the result is empty. Is this approach right or is there any other approach to be followed on parquet data? Sample json data: {"_id":"0899f824e118d390f57bc2f279bd38fe","_rev":"1-81cc25723e02f50cb6fef7ce0b0f4f38","deviceId":"BELT001","timestamp":"2016

AWS Athena JDBC PreparedStatement

馋奶兔 提交于 2019-12-05 12:21:20
I don't manage to make AWS Athena JDBC driver working with PreparedStatement and binded variables. If I put the desired value of a column directly in the SQL string, it works. But if I use placeholders '?' and I bind variables with setters of PreparedStatement, it does not work. Of course, we know we have to use the second way of doing (for caching, avoid SQL injection and so on). I use JDBC Driver AthenaJDBC42_2.0.2.jar. I get the following error when trying to use placeholders '?' in the SQL String. The error is thrown when I get the PreparedStatement from the JDBC Connection. It complains

How to skip headers when we are reading data from a csv file in s3 and creating a table in aws athena.

岁酱吖の 提交于 2019-12-05 04:48:46
I am trying to read csv data from s3 bucket and creating a table in AWS Athena. My table when created was unable to skip the header information of my CSV file. Query Example : CREATE EXTERNAL TABLE IF NOT EXISTS table_name ( `event_type_id` string, `customer_id` string, `date` string, `email` string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = "|", "quoteChar" = "\"" ) LOCATION 's3://location/' TBLPROPERTIES ("skip.header.line.count"="1"); skip.header.line.count doesn't seem to work. But this does not work out. I think Aws has some

AWS Glue: crawler misinterprets timestamps as strings. GLUE ETL meant to convert strings to timestamps makes them NULL

故事扮演 提交于 2019-12-05 01:47:36
问题 I have been playing around with AWS Glue for some quick analytics by following the tutorial here While I have been able to successfully create crawlers and discover data in Athena, I've had issues with the data types created by the crawler. The date and timestamp data types get read as string data types. I followed this up by creating an ETL job in GLUE using the data source created by the crawler as the input and a target table in Amazon S3 As part of the mapping transformation, I converted

Alternatives for Athena to query the data on S3

房东的猫 提交于 2019-12-04 20:34:07
I have around 300 GBs of data on S3 . Lets say the data look like: ## S3://Bucket/Country/Month/Day/1.csv S3://Countries/Germany/06/01/1.csv S3://Countries/Germany/06/01/2.csv S3://Countries/Germany/06/01/3.csv S3://Countries/Germany/06/02/1.csv S3://Countries/Germany/06/02/2.csv We are doing some complex aggregation on the data, and because some countries data is big and some countries data is small, the AWS EMR doesn't makes sense to use, as once the small countries are finished, the resources are being wasted, and the big countries keep running for long time. Therefore, we decided to use

AWS Athena concurrency limits: Number of submitted queries VS number of running queries

。_饼干妹妹 提交于 2019-12-04 19:28:22
According to AWS Athena limitations you can submit up to 20 queries of the same type at a time, but it is a soft limit and can be increased on request. I use boto3 to interact with Athena and my script submits 16 CTAS queries each of which takes about 2 minutes to finish. In a AWS account, it is only me who is using Athena service. However, when I look at the state of queries through console I see that only a few of queries (5 on average) are actually being executed despite all of them being in state Running . Here is what would normally see in Athena hisotry tab: I understand that, after I

How to pivot rows into columns in AWS Athena?

心不动则不痛 提交于 2019-12-04 09:27:23
I'm new to AWS Athena and trying to pivot some rows into columns, similar to the top answer in this StackOverflow post . However, when I tried: SELECT column1, column2, column3 FROM data PIVOT ( MIN(column3) FOR column2 IN ('VALUE1','VALUE2','VALUE3','VALUE4') ) I get the error: mismatched input '(' expecting {',', ')'} (service: amazonathena; status code: 400; error code: invalidrequestexception Does anyone know how to accomplish what I am trying to achieve in AWS Athena? Extending @kadrach 's answer. Assuming a table like this uid | key | value1 | value2 ----+-----+--------+-------- 1 | A |

SHOW PARTITIONS with order by in Amazon Athena

五迷三道 提交于 2019-12-04 07:21:32
I have this query: SHOW PARTITIONS tablename; Result is: dt=2018-01-12 dt=2018-01-20 dt=2018-05-21 dt=2018-04-07 dt=2018-01-03 This gives the list of partitions per table. The partition field for this table is dt which is a date column. I want to see the partitions ordered. The documentation doesn't explain how to do it: https://docs.aws.amazon.com/athena/latest/ug/show-partitions.html I tried to add order by: SHOW PARTITIONS tablename order by dt; But it gives: AmazonAthena; Status Code: 400; Error Code: InvalidRequestException; I just faced the same issue and found a solution in information