amazon-athena

Unable to locate hive jars to connect to metastore : while using pyspark job to connect to athena tables

坚强是说给别人听的谎言 提交于 2021-02-11 13:19:44
问题 We are using sagemaker instance to connect to EMR in AWS. We are having some pyspark scripts that unloads athena tables and processes them as part of pipeline. We are using athena tables using glue catalog but when we try to run the job via spark submit, our job fails Code snippet from pyspark import SparkContext, SparkConf from pyspark.context import SparkContext from pyspark.sql import Row, SQLContext, SparkSession import pyspark.sql.dataframe def process_data(): conf = SparkConf()

Presto (Athena) loading of a CSV file with quote-escaped commas

别来无恙 提交于 2021-02-10 13:37:16
问题 Consider the following row in a CSV file: 1,0,True,"{""foo"":null,""bar"":null}",0,1 ▲ The highlighted , is part of a column . That is, this full text: " {""foo"":null,""bar"":null}" is the value of a single column. However AWS Athena is interpreting the highlighted , as a column-delimiting comma , incorrectly splitting that text into multiple columns. I know I could change the column delimiter to something else to avoid this problem. My question is: Is this a bug in AWS Athena / Presto? How

How S3 select pricing works? What is data returned and scanned in s3 select means

左心房为你撑大大i 提交于 2021-02-07 13:40:38
问题 I have a 1M rows of CSV data. select 10 rows, Will I be billed for 10 rows. What is data returned and data scanned means in S3 Select? There is less documentation on these terms of S3 select 回答1: To keep things simple lets forget for some time that S3 reads in a columnar way. Suppose you have the following data: | City | Last Updated Date | |------------|---------------------| | London | 1st Jan | | London | 2nd Jan | | New Delhi | 2nd Jan | A query for fetching the latest update date forces

Presto check if NULL and return default (NVL analog)

[亡魂溺海] 提交于 2021-02-06 14:48:18
问题 Is there any analog of NVL in Presto DB? I need to check if a field is NULL and return a default value. I solve this somehow like this: SELECT CASE WHEN my_field is null THEN 0 ELSE my_field END FROM my_table But I'm curious if there is something that could simplify this code. 回答1: The ISO SQL function for that is COALESCE coalesce(my_field,0) https://prestodb.io/docs/current/functions/conditional.html P.S. COALESCE can be used with multiple arguments. It will return the first (from the left)

AWS Athena (Presto) how to transpose map to columns

℡╲_俬逩灬. 提交于 2021-02-05 10:51:49
问题 AWS Athena query question; I have a nested map in my rows, of which I would like to transpose the keys to columns. I could name the columns explicitly like items['label_a'] , but in this case the keys are actually dynamic... From these rows: {id=1, items={label_a=foo, label_b=foo}} {id=2, items={label_a=bar, label_c=bar}} {id=3, items={label_b=baz, label_c=baz}} I would like to get a table like so: | id | label_a | label_b | label_c | ------------------------------------ | 1 | foo | foo | | |

AWS Athena (Presto) how to transpose map to columns

青春壹個敷衍的年華 提交于 2021-02-05 10:51:23
问题 AWS Athena query question; I have a nested map in my rows, of which I would like to transpose the keys to columns. I could name the columns explicitly like items['label_a'] , but in this case the keys are actually dynamic... From these rows: {id=1, items={label_a=foo, label_b=foo}} {id=2, items={label_a=bar, label_c=bar}} {id=3, items={label_b=baz, label_c=baz}} I would like to get a table like so: | id | label_a | label_b | label_c | ------------------------------------ | 1 | foo | foo | | |

Querying optional nested JSON fields in Athena

混江龙づ霸主 提交于 2021-02-05 09:25:10
问题 I have json data that looks something like: { "col1" : 123, "metadata" : { "opt1" : 456, "opt2" : 789 } } where the various metadata fields (of which there are many) are optional and may or may not be present. My query is: select col1, metadata.opt1 from "db-name".tablename If opt1 is not present in any rows, I would expect this to return all rows with a blank for the opt1 column, but if there wasn't a row with the opt1 in metadata when the crawler ran (and might still not be present in data

athena presto - multiple columns from long to wide

柔情痞子 提交于 2021-02-04 08:37:25
问题 I am new to Athena and I am trying to understand how to turn multiple columns from long to wide format. It seems like presto is what is needed, but I've only successfully been able to apply map_agg to one variable. I think my below final outcome can be achieved with multimap_agg but cannot quite get it to work. Below I walk through my steps and data. If you have some suggestions or questions, please let me know! First, the data starts like this: id | letter | number | value ------------------

Create Table in Athena From Nested JSON

你说的曾经没有我的故事 提交于 2021-01-29 22:52:47
问题 I have nested JSON of type [{ "emails": [{ "label": "", "primary": "", "relationdef_id": "", "type": "", "value": "" }], "licenses": [{ "allocated": "", "parent_type": "", "parentid": "", "product_type": "", "purchased_license_id": "", "service_type": "" }, { "allocated": "", "parent_type": "", "parentid": "", "product_type": "", "purchased_license_id": "", "service_type": "" }] }, { "emails": [{ "label": "", "primary": "", "relationdef_id": "", "type": "", "value": "" }], "licenses": [{

Create Table in Athena From Nested JSON

Deadly 提交于 2021-01-29 22:50:47
问题 I have nested JSON of type [{ "emails": [{ "label": "", "primary": "", "relationdef_id": "", "type": "", "value": "" }], "licenses": [{ "allocated": "", "parent_type": "", "parentid": "", "product_type": "", "purchased_license_id": "", "service_type": "" }, { "allocated": "", "parent_type": "", "parentid": "", "product_type": "", "purchased_license_id": "", "service_type": "" }] }, { "emails": [{ "label": "", "primary": "", "relationdef_id": "", "type": "", "value": "" }], "licenses": [{