amazon-athena

Create Table in Athena From Nested JSON

≯℡__Kan透↙ 提交于 2021-01-29 22:23:02
问题 I have nested JSON of type [{ "emails": [{ "label": "", "primary": "", "relationdef_id": "", "type": "", "value": "" }], "licenses": [{ "allocated": "", "parent_type": "", "parentid": "", "product_type": "", "purchased_license_id": "", "service_type": "" }, { "allocated": "", "parent_type": "", "parentid": "", "product_type": "", "purchased_license_id": "", "service_type": "" }] }, { "emails": [{ "label": "", "primary": "", "relationdef_id": "", "type": "", "value": "" }], "licenses": [{

Create Table in Athena From Nested JSON

ⅰ亾dé卋堺 提交于 2021-01-29 21:01:10
问题 I have nested JSON of type [{ "emails": [{ "label": "", "primary": "", "relationdef_id": "", "type": "", "value": "" }], "licenses": [{ "allocated": "", "parent_type": "", "parentid": "", "product_type": "", "purchased_license_id": "", "service_type": "" }, { "allocated": "", "parent_type": "", "parentid": "", "product_type": "", "purchased_license_id": "", "service_type": "" }] }, { "emails": [{ "label": "", "primary": "", "relationdef_id": "", "type": "", "value": "" }], "licenses": [{

AWS Athena row cast fails when key is a reserved keyword despite double quotes

社会主义新天地 提交于 2021-01-29 18:55:57
问题 I'm working with data in AWS Athena, and I'm trying to match the structure of some input data. This involves a nested structure where "from" is a key. This consistently throws errors. I've narrowed the issue down to the fact that Athena queries don't work when you try to use reserved keywords as keys in rows. The following examples demonstrate this behavior. This simple case, SELECT CAST(ROW(1) AS ROW("from" INTEGER)) , fails with the following error: GENERIC_INTERNAL_ERROR: Unable to create

Special characters in AWS Athena show up as question marks

我的梦境 提交于 2021-01-29 11:08:42
问题 I've added a table in AWS Athena from a csv file, which uses special characters "æøå". These show up as � in the output. The csv file is encoded using unicode. I've also tried changing the encoding to UTF-8, with no luck. I've uploaded the csv in S3 and then added the table to Athena using the following DDL: CREATE EXTERNAL TABLE `regions_dk`( `postnummer` string COMMENT 'from deserializer', `kommuner` string COMMENT 'from deserializer', `regioner` string COMMENT 'from deserializer') ROW

Athena displays special characters as?

喜欢而已 提交于 2021-01-29 10:30:51
问题 I have an external table with below DDL CREATE EXTERNAL TABLE `table_1`( `name` string COMMENT 'from deserializer', `desc1` string COMMENT 'from deserializer', `desc2` string COMMENT 'from deserializer', ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( 'quoteChar'='\"', 'separatorChar'='|', 'skip.header.line.count'='1') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

AWS Athena: Named boto3 queries not creating corresponding tables

拈花ヽ惹草 提交于 2021-01-29 10:24:06
问题 I have the following boto3 draft script #!/usr/bin/env python3 import boto3 client = boto3.client('athena') BUCKETS='buckets.txt' DATABASE='some_db' QUERY_STR="""CREATE EXTERNAL TABLE IF NOT EXISTS some_db.{}( BucketOwner STRING, Bucket STRING, RequestDateTime STRING, RemoteIP STRING, Requester STRING, RequestID STRING, Operation STRING, Key STRING, RequestURI_operation STRING, RequestURI_key STRING, RequestURI_httpProtoversion STRING, HTTPstatus STRING, ErrorCode STRING, BytesSent BIGINT,

Query by “$path” field

我是研究僧i 提交于 2021-01-29 05:52:55
问题 I want to query by a file / group of files under a partition inside a table. I found out that when I'm using the "$path" field Athena scans the entire partition, and not the files I want Is there a way to make this kind of query more efficient and scan only the given files? Something like partition pruning for files... Here is a sample query: SELECT * FROM my_table WHERE day = '2019-01-01' AND "$path" = 's3://my-bucket/my-table/day=2019-01-01/my_file' 回答1: No. It's not possible to get Athena

Query by “$path” field

怎甘沉沦 提交于 2021-01-29 05:50:23
问题 I want to query by a file / group of files under a partition inside a table. I found out that when I'm using the "$path" field Athena scans the entire partition, and not the files I want Is there a way to make this kind of query more efficient and scan only the given files? Something like partition pruning for files... Here is a sample query: SELECT * FROM my_table WHERE day = '2019-01-01' AND "$path" = 's3://my-bucket/my-table/day=2019-01-01/my_file' 回答1: No. It's not possible to get Athena

Rename Column in Athena

时光总嘲笑我的痴心妄想 提交于 2021-01-28 14:27:05
问题 Athena table "organization" reads data from parquet files in s3. I need to change a column name from "cost" to "fee" . The data files goes back to Jan 2018. If I just rename the column in Athena , table won't be able to find data for new column in parquet file. Please let me know if there ways to resolve it. 回答1: You have to change the schema and point to new column "fee" But it depends on ur situation. If you have two data sets, in one dataset it is called "cost" and in another dataset it is

Spark Small ORC Stripes

跟風遠走 提交于 2021-01-28 11:58:32
问题 We use Spark to flatten out clickstream data and then write the same to S3 in ORC+zlib format, I have tried changing many settings in Spark but still the resultant stripe sizes of the ORC file getting created are very small (<2MB) Things which I tried so far to decrease the stripe size, Earlier each file was 20MB in size, using coalesce I am now creating files which are of 250-300MB in size, but still there are 200 stripes per file i.e each stripe <2MB Tried using hivecontext instead of