amazon-athena

AWS Glue Crawler Cannot Extract CSV Headers

霸气de小男生 提交于 2019-12-24 03:26:05
问题 At my wits end here... I have 15 csv files that I am generating from a beeline query like: beeline -u CONN_STR --outputformat=dsv -e "SELECT ... " > data.csv I chose dsv because some string fields include commas and they are not quoted, which breaks glue even more. Besides, according to the docs, the built in csv classifier can handle pipes (and for the most part, it does). Anyway, I upload these 15 csv files to an s3 bucket and run my crawler. Everything works great. For 14 of them. Glue is

How to avoid AWS Athena CTAS query creating small files?

人盡茶涼 提交于 2019-12-24 01:16:16
问题 I'm unable to figure out what is wrong with my CTAS query, it breaks the data into smaller files while storing inside a partition even though I haven't mentioned any bucketing columns. Is there a way to avoid these small files and store as one single file per partition as files lesser than 128 MB would cause additional overhead? CREATE TABLE sampledb.yellow_trip_data_parquet WITH( format = 'PARQUET' parquet_compression = 'GZIP', external_location='s3://mybucket/Athena/tables/parquet/'

converting a struct to a json when querying athena

╄→гoц情女王★ 提交于 2019-12-23 09:25:38
问题 I have an athena table which I did not create or manage, but can query. one of the fields is a struct type. for the sake of the example let's suppose it looks like this: my_field struct<a:string, b:string, c:struct<d:string,e:string> > Now, I know how to query specific fields within this struct. But in one of my queries I need to extract the complete struct. so I just use: select my_field from my_table and the result looks like a string: {a=aaa, b=bbb, c={d=ddd, e=eee}} I want to get the

Convert folders structure to partitions on S3 using Spark

北城以北 提交于 2019-12-23 04:35:19
问题 I have a lot of data on S3 which are in folder instead of partitions. The structure looks like this: ## s3://bucket/countryname/year/weeknumber/a.csv s3://Countries/Canada/2019/20/part-1.csv s3://Countries/Canada/2019/20/part-2.csv s3://Countries/Canada/2019/20/part-3.csv s3://Countries/Canada/2019/21/part-1.csv s3://Countries/Canada/2019/21/part-2.csv Is there any way to convert that data as parititons. Something like this: s3://Countries/Country=Canada/Year=2019/Week=20/part-1.csv s3:/

Reusing subqueries in AWS Athena generate large amount of data scanned

笑着哭i 提交于 2019-12-23 03:44:10
问题 On AWS Athena, I am trying to reuse computed data using a WITH clause, e.g. WITH temp_table AS (...) SELECT ... FROM temp_table t0, temp_table t1, temp_table t2 WHERE ... If the query is fast, the "Data scanned" goes through the roof. As if temp_table is computed for each time it is reference in the FROM clause. I don't see the issue if I create a temp table separately and use it multiple times in the query. Is there a way to really reuse a subquery multiple times without any penalty? 来源:

Connect to Athena using JDBC in a maven project

痞子三分冷 提交于 2019-12-22 13:11:34
问题 I'm trying to connect to the Amazon Athena, using jdbc in a maven project, but an exception is being raised. I think that the class is not being found. In Athena's guide it says: Set the JDBC property, aws_credentials_provider_class, equal to the class name, and include itin your classpath. (1) Since I'm using eclipse, I thought that the class would be already in the classpath, but apparently not. I tested the code in a simple java project (not maven) and it worked. AmazonCredentialsProvider

AWS Athena (Presto) OFFSET support

匆匆过客 提交于 2019-12-22 10:45:55
问题 I would like to know if there is support for OFFSET in AWS Athena. For mysql the following query is running but in athena it is giving me error. Any example would be helpful. select * from employee where empSal >3000 LIMIT 300 OFFSET 20 回答1: Athena is basically managed Presto. Since Presto 311 you can use OFFSET m LIMIT n syntax or ANSI SQL equivalent: OFFSET m ROWS FETCH NEXT n ROWS ONLY . For older versions (and this includes AWS Athena as of this writing) , you can use row_number() window

SerDe properties list for AWS Athena (JSON)

我的未来我决定 提交于 2019-12-22 09:52:23
问题 I'm testing the Athena product of AWS, so far is working very good. But I want to know the list of SerDe properties. I've searched far and wide and couldn't find it. I'm using this one for example "ignore.malformed.json" = "true" , but I'm pretty sure there are a ton of other options to tune the queries. I couldn't find info for example, on what the "path" property does, so having the full list will be amazing. I have looked at Apache Hive docs but couldn't find this, and neither on AWS docs

SHOW PARTITIONS with order by in Amazon Athena

。_饼干妹妹 提交于 2019-12-21 13:07:51
问题 I have this query: SHOW PARTITIONS tablename; Result is: dt=2018-01-12 dt=2018-01-20 dt=2018-05-21 dt=2018-04-07 dt=2018-01-03 This gives the list of partitions per table. The partition field for this table is dt which is a date column. I want to see the partitions ordered. The documentation doesn't explain how to do it: https://docs.aws.amazon.com/athena/latest/ug/show-partitions.html I tried to add order by: SHOW PARTITIONS tablename order by dt; But it gives: AmazonAthena; Status Code: 400