amazon-athena

OFFSET on AWS Athena

半城伤御伤魂 提交于 2019-12-01 06:01:12
I would like to run a query on AWS Athena with both a LIMIT and an OFFSET clause. I take it the former is supported while the latter is not . Is there any way of emulating this functionality using other methods? Using OFFSET for pagination is very inefficient, especially for an analytic database like Presto that often has to perform a full table or partition scan. Additionally, the results will not necessarily be consistent between queries, so you can have duplicate or missing results when navigating between pages. In an OLTP database like MySQL or PostgreSQL, it's better to use a range query

OFFSET on AWS Athena

隐身守侯 提交于 2019-12-01 03:01:38
问题 I would like to run a query on AWS Athena with both a LIMIT and an OFFSET clause. I take it the former is supported while the latter is not. Is there any way of emulating this functionality using other methods? 回答1: Using OFFSET for pagination is very inefficient, especially for an analytic database like Presto that often has to perform a full table or partition scan. Additionally, the results will not necessarily be consistent between queries, so you can have duplicate or missing results

Create Athena table from nested json source

夙愿已清 提交于 2019-11-30 20:46:12
问题 How shall I create a Athena table from the nested json file ? This is my sample json file. I only need selected key value pairs like roofcondition and garagestalls. { "reportId":"7bc7fa76-bf53-4c21-85d6-118f6a8f4244", "reportOrderedTS":"1529996028730", "createdTS":"1530304910154", "report":"{'summaryElements': [{'value': 'GOOD', 'key': 'roofCondition'}, {'value': '98', 'key': 'storiesConfidence'}{'value': '0', 'key': 'garageStalls'}], 'elements': [{'source': 'xyz', 'imageId': '0xxx_png',

Store multiple elements in json files in AWS Athena

风流意气都作罢 提交于 2019-11-30 13:19:43
问题 I have some json files stored in a S3 bucket , where each file has multiple elements of same structure. For example, [{"eventId":"1","eventName":"INSERT","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"New item!","Id":101}},{"eventId":"2","eventName":"MODIFY","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"This item has changed","Id":101}},{"eventId":"3","eventName":"REMOVE","eventVersion":"1.0",

How to create AWS Glue table where partitions have different columns? ('HIVE_PARTITION_SCHEMA_MISMATCH')

老子叫甜甜 提交于 2019-11-30 06:23:32
As per this AWS Forum Thread , does anyone know how to use AWS Glue to create an AWS Athena table whose partitions contain different schemas (in this case different subsets of columns from the table schema)? At the moment, when I run the crawler over this data and then make a query in Athena, I get the error 'HIVE_PARTITION_SCHEMA_MISMATCH' My use case is: Partitions represent days Files represent events Each event is a json blob in a single s3 file An event contains a subset of columns (dependent on the type of event) The 'schema' of the entire table is the full set of columns for all the

Aws Athena - Create external table skipping first row

自闭症网瘾萝莉.ら 提交于 2019-11-29 09:22:53
I'm trying to create an external table on csv files with Aws Athena with the code below but the line TBLPROPERTIES ("skip.header.line.count"="1") doesn't work: it doesn't skip the first line (header) of the csv file. CREATE EXTERNAL TABLE mytable ( colA string, colB int ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( 'separatorChar' = ',', 'quoteChar' = '\"', 'escapeChar' = '\\' ) STORED AS TEXTFILE LOCATION 's3://mybucket/mylocation/' TBLPROPERTIES ( "skip.header.line.count"="1") Any advise? Filippo Loddo Just tried the "skip.header.line.count"="1" and

How to create AWS Glue table where partitions have different columns? ('HIVE_PARTITION_SCHEMA_MISMATCH')

天大地大妈咪最大 提交于 2019-11-29 01:42:24
问题 As per this AWS Forum Thread, does anyone know how to use AWS Glue to create an AWS Athena table whose partitions contain different schemas (in this case different subsets of columns from the table schema)? At the moment, when I run the crawler over this data and then make a query in Athena, I get the error 'HIVE_PARTITION_SCHEMA_MISMATCH' My use case is: Partitions represent days Files represent events Each event is a json blob in a single s3 file An event contains a subset of columns

How to make MSCK REPAIR TABLE execute automatically in AWS Athena

梦想与她 提交于 2019-11-28 23:16:14
I have a spark batch job which is executed hourly. Each run generates and stores new data in S3 with the directory naming pattern DATA/YEAR=?/MONTH=?/DATE=?/datafile . After uploading the data to S3 , I want to investigate them using Athena . More, I would like to visualize them in QuickSight by connecting to Athena as a data source. The problem is that, after each run of my Spark batch, the newly generated data stored in S3 will not be discovered by Athena, unless I manually run the query MSCK REPARI TABLE . Is there a way to make Athena update the data automatically, so that I can create a

Create Athena table from nested json source

泪湿孤枕 提交于 2019-11-28 13:06:18
How shall I create a Athena table from the nested json file ? This is my sample json file. I only need selected key value pairs like roofcondition and garagestalls. { "reportId":"7bc7fa76-bf53-4c21-85d6-118f6a8f4244", "reportOrderedTS":"1529996028730", "createdTS":"1530304910154", "report":"{'summaryElements': [{'value': 'GOOD', 'key': 'roofCondition'}, {'value': '98', 'key': 'storiesConfidence'}{'value': '0', 'key': 'garageStalls'}], 'elements': [{'source': 'xyz', 'imageId': '0xxx_png', 'modelVersion': '1.21.0', 'key': 'pool'}, {'source': 'xyz', 'imageId': '0111_png', 'value': 'GOOD',

Partition Athena query by S3 created date

五迷三道 提交于 2019-11-27 06:33:13
问题 I have a S3 bucket with ~ 70 million JSONs (~ 15TB) and an athena table to query by timestamp and some other keys definied in the JSON. It is guaranteed, that the timestamp in the JSON is more or less equal to the S3-createdDate of the JSON (or at least equal enough for the purpose of my query) Can I somehow improve querying-performance (and cost) by adding the createddate as something like a "partition" - which I unterstand seems only to be possible for prefixes/folders? edit: I currently