amazon-athena | 易学教程

AWS Glue Crawler Cannot Extract CSV Headers

阅读更多关于 AWS Glue Crawler Cannot Extract CSV Headers

问题 At my wits end here... I have 15 csv files that I am generating from a beeline query like: beeline -u CONN_STR --outputformat=dsv -e "SELECT ... " > data.csv I chose dsv because some string fields include commas and they are not quoted, which breaks glue even more. Besides, according to the docs, the built in csv classifier can handle pipes (and for the most part, it does). Anyway, I upload these 15 csv files to an s3 bucket and run my crawler. Everything works great. For 14 of them. Glue is

Athena queries between tables in different accounts

阅读更多关于 Athena queries between tables in different accounts

问题 I can individually access two different Athena tables using two different IAM roles because each lie in different accounts. Is there a way to run a single query that pulls from both (ie. INNER JOIN)? 来源： https://stackoverflow.com/questions/48876047/athena-queries-between-tables-in-different-accounts

How to avoid AWS Athena CTAS query creating small files?

阅读更多关于 How to avoid AWS Athena CTAS query creating small files?

问题 I'm unable to figure out what is wrong with my CTAS query, it breaks the data into smaller files while storing inside a partition even though I haven't mentioned any bucketing columns. Is there a way to avoid these small files and store as one single file per partition as files lesser than 128 MB would cause additional overhead? CREATE TABLE sampledb.yellow_trip_data_parquet WITH( format = 'PARQUET' parquet_compression = 'GZIP', external_location='s3://mybucket/Athena/tables/parquet/'

converting a struct to a json when querying athena

阅读更多关于 converting a struct to a json when querying athena

问题 I have an athena table which I did not create or manage, but can query. one of the fields is a struct type. for the sake of the example let's suppose it looks like this: my_field struct<a:string, b:string, c:struct<d:string,e:string> > Now, I know how to query specific fields within this struct. But in one of my queries I need to extract the complete struct. so I just use: select my_field from my_table and the result looks like a string: {a=aaa, b=bbb, c={d=ddd, e=eee}} I want to get the

Convert folders structure to partitions on S3 using Spark

阅读更多关于 Convert folders structure to partitions on S3 using Spark

问题 I have a lot of data on S3 which are in folder instead of partitions. The structure looks like this: ## s3://bucket/countryname/year/weeknumber/a.csv s3://Countries/Canada/2019/20/part-1.csv s3://Countries/Canada/2019/20/part-2.csv s3://Countries/Canada/2019/20/part-3.csv s3://Countries/Canada/2019/21/part-1.csv s3://Countries/Canada/2019/21/part-2.csv Is there any way to convert that data as parititons. Something like this: s3://Countries/Country=Canada/Year=2019/Week=20/part-1.csv s3:/

Reusing subqueries in AWS Athena generate large amount of data scanned

阅读更多关于 Reusing subqueries in AWS Athena generate large amount of data scanned

问题 On AWS Athena, I am trying to reuse computed data using a WITH clause, e.g. WITH temp_table AS (...) SELECT ... FROM temp_table t0, temp_table t1, temp_table t2 WHERE ... If the query is fast, the "Data scanned" goes through the roof. As if temp_table is computed for each time it is reference in the FROM clause. I don't see the issue if I create a temp table separately and use it multiple times in the query. Is there a way to really reuse a subquery multiple times without any penalty? 来源：

Connect to Athena using JDBC in a maven project

阅读更多关于 Connect to Athena using JDBC in a maven project

问题 I'm trying to connect to the Amazon Athena, using jdbc in a maven project, but an exception is being raised. I think that the class is not being found. In Athena's guide it says: Set the JDBC property, aws_credentials_provider_class, equal to the class name, and include itin your classpath. (1) Since I'm using eclipse, I thought that the class would be already in the classpath, but apparently not. I tested the code in a simple java project (not maven) and it worked. AmazonCredentialsProvider

AWS Athena (Presto) OFFSET support

阅读更多关于 AWS Athena (Presto) OFFSET support

问题 I would like to know if there is support for OFFSET in AWS Athena. For mysql the following query is running but in athena it is giving me error. Any example would be helpful. select * from employee where empSal >3000 LIMIT 300 OFFSET 20 回答1: Athena is basically managed Presto. Since Presto 311 you can use OFFSET m LIMIT n syntax or ANSI SQL equivalent: OFFSET m ROWS FETCH NEXT n ROWS ONLY . For older versions (and this includes AWS Athena as of this writing) , you can use row_number() window

SerDe properties list for AWS Athena (JSON)

阅读更多关于 SerDe properties list for AWS Athena (JSON)

问题 I'm testing the Athena product of AWS, so far is working very good. But I want to know the list of SerDe properties. I've searched far and wide and couldn't find it. I'm using this one for example "ignore.malformed.json" = "true" , but I'm pretty sure there are a ton of other options to tune the queries. I couldn't find info for example, on what the "path" property does, so having the full list will be amazing. I have looked at Apache Hive docs but couldn't find this, and neither on AWS docs

SHOW PARTITIONS with order by in Amazon Athena

阅读更多关于 SHOW PARTITIONS with order by in Amazon Athena

问题 I have this query: SHOW PARTITIONS tablename; Result is: dt=2018-01-12 dt=2018-01-20 dt=2018-05-21 dt=2018-04-07 dt=2018-01-03 This gives the list of partitions per table. The partition field for this table is dt which is a date column. I want to see the partitions ordered. The documentation doesn't explain how to do it: https://docs.aws.amazon.com/athena/latest/ug/show-partitions.html I tried to add order by: SHOW PARTITIONS tablename order by dt; But it gives: AmazonAthena; Status Code: 400