amazon-athena | 易学教程

SerDe properties list for AWS Athena (JSON)

阅读更多关于 SerDe properties list for AWS Athena (JSON)

I'm testing the Athena product of AWS, so far is working very good. But I want to know the list of SerDe properties. I've searched far and wide and couldn't find it. I'm using this one for example "ignore.malformed.json" = "true" , but I'm pretty sure there are a ton of other options to tune the queries. I couldn't find info for example, on what the "path" property does, so having the full list will be amazing. I have looked at Apache Hive docs but couldn't find this, and neither on AWS docs/forums. Thanks! It seems you are using the Openx-JsonSerDe http://docs.aws.amazon.com/athena/latest/ug

reduce the amount of data scanned by Athena when using aggregate functions

阅读更多关于 reduce the amount of data scanned by Athena when using aggregate functions

The below query scans 100 mb of data. select * from table where column1 = 'val' and partition_id = '20190309'; However the below query scans 15 GB of data (there are over 90 partitions) select * from table where column1 = 'val' and partition_id in (select max(partition_id) from table); How can I optimize the second query to scan the same amount of data as the first? There are two problems here. The efficiency of the the scalar subquery above select max(partition_id) from table , and the one @PiotrFindeisen pointed out around dynamic filtering. The the first problem is that queries over the

How to Query parquet data from Amazon Athena?

阅读更多关于 How to Query parquet data from Amazon Athena?

Athena creates a temporary table using fields in S3 table. I have done this using JSON data. Could you help me on how to create table using parquet data? I have tried following: Converted sample JSON data to parquet data. Uploaded parquet data to S3. Created temporary table using columns of JSON data. By doing this I am able to a execute query but the result is empty. Is this approach right or is there any other approach to be followed on parquet data? Sample json data: {"_id":"0899f824e118d390f57bc2f279bd38fe","_rev":"1-81cc25723e02f50cb6fef7ce0b0f4f38","deviceId":"BELT001","timestamp":"2016

AWS Athena JDBC PreparedStatement

阅读更多关于 AWS Athena JDBC PreparedStatement

I don't manage to make AWS Athena JDBC driver working with PreparedStatement and binded variables. If I put the desired value of a column directly in the SQL string, it works. But if I use placeholders '?' and I bind variables with setters of PreparedStatement, it does not work. Of course, we know we have to use the second way of doing (for caching, avoid SQL injection and so on). I use JDBC Driver AthenaJDBC42_2.0.2.jar. I get the following error when trying to use placeholders '?' in the SQL String. The error is thrown when I get the PreparedStatement from the JDBC Connection. It complains

How to skip headers when we are reading data from a csv file in s3 and creating a table in aws athena.

阅读更多关于 How to skip headers when we are reading data from a csv file in s3 and creating a table in aws athena.

I am trying to read csv data from s3 bucket and creating a table in AWS Athena. My table when created was unable to skip the header information of my CSV file. Query Example : CREATE EXTERNAL TABLE IF NOT EXISTS table_name ( `event_type_id` string, `customer_id` string, `date` string, `email` string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = "|", "quoteChar" = "\"" ) LOCATION 's3://location/' TBLPROPERTIES ("skip.header.line.count"="1"); skip.header.line.count doesn't seem to work. But this does not work out. I think Aws has some

AWS Glue: crawler misinterprets timestamps as strings. GLUE ETL meant to convert strings to timestamps makes them NULL

阅读更多关于 AWS Glue: crawler misinterprets timestamps as strings. GLUE ETL meant to convert strings to timestamps makes them NULL

问题 I have been playing around with AWS Glue for some quick analytics by following the tutorial here While I have been able to successfully create crawlers and discover data in Athena, I've had issues with the data types created by the crawler. The date and timestamp data types get read as string data types. I followed this up by creating an ETL job in GLUE using the data source created by the crawler as the input and a target table in Amazon S3 As part of the mapping transformation, I converted

Alternatives for Athena to query the data on S3

阅读更多关于 Alternatives for Athena to query the data on S3

I have around 300 GBs of data on S3 . Lets say the data look like: ## S3://Bucket/Country/Month/Day/1.csv S3://Countries/Germany/06/01/1.csv S3://Countries/Germany/06/01/2.csv S3://Countries/Germany/06/01/3.csv S3://Countries/Germany/06/02/1.csv S3://Countries/Germany/06/02/2.csv We are doing some complex aggregation on the data, and because some countries data is big and some countries data is small, the AWS EMR doesn't makes sense to use, as once the small countries are finished, the resources are being wasted, and the big countries keep running for long time. Therefore, we decided to use

AWS Athena concurrency limits: Number of submitted queries VS number of running queries

阅读更多关于 AWS Athena concurrency limits: Number of submitted queries VS number of running queries

According to AWS Athena limitations you can submit up to 20 queries of the same type at a time, but it is a soft limit and can be increased on request. I use boto3 to interact with Athena and my script submits 16 CTAS queries each of which takes about 2 minutes to finish. In a AWS account, it is only me who is using Athena service. However, when I look at the state of queries through console I see that only a few of queries (5 on average) are actually being executed despite all of them being in state Running . Here is what would normally see in Athena hisotry tab: I understand that, after I

How to pivot rows into columns in AWS Athena?

阅读更多关于 How to pivot rows into columns in AWS Athena?

I'm new to AWS Athena and trying to pivot some rows into columns, similar to the top answer in this StackOverflow post . However, when I tried: SELECT column1, column2, column3 FROM data PIVOT ( MIN(column3) FOR column2 IN ('VALUE1','VALUE2','VALUE3','VALUE4') ) I get the error: mismatched input '(' expecting {',', ')'} (service: amazonathena; status code: 400; error code: invalidrequestexception Does anyone know how to accomplish what I am trying to achieve in AWS Athena? Extending @kadrach 's answer. Assuming a table like this uid | key | value1 | value2 ----+-----+--------+-------- 1 | A |

SHOW PARTITIONS with order by in Amazon Athena

阅读更多关于 SHOW PARTITIONS with order by in Amazon Athena

I have this query: SHOW PARTITIONS tablename; Result is: dt=2018-01-12 dt=2018-01-20 dt=2018-05-21 dt=2018-04-07 dt=2018-01-03 This gives the list of partitions per table. The partition field for this table is dt which is a date column. I want to see the partitions ordered. The documentation doesn't explain how to do it: https://docs.aws.amazon.com/athena/latest/ug/show-partitions.html I tried to add order by: SHOW PARTITIONS tablename order by dt; But it gives: AmazonAthena; Status Code: 400; Error Code: InvalidRequestException; I just faced the same issue and found a solution in information

订阅 amazon-athena