amazon-athena

Athena can't resolve CSV files from AWS DMS

Deadly 提交于 2019-12-08 05:23:10
问题 I've DMS configured to continuously replicate data from MySQL RDS to S3. This creates two type of CSV files: a full load and change data capture (CDC). According to my tests, I have the following files: testdb/addresses/LOAD001.csv.gz testdb/addresses/20180405_205807186_csv.gz After DMS is running properly, I trigger a AWS Glue Crawler to build the Data Catalog for the S3 Bucket that contains the MySQL Replication files, so the Athena users will be able to build queries in our S3 based Data

How to handle TIMESTAMP_MICROS parquet fields in Presto/Athena

帅比萌擦擦* 提交于 2019-12-08 03:56:17
问题 Presently, we have a DMS task that will take the contents of a MySQL DB and dump files to S3 in parquet format. The format for the timestamps in parquet ends up being TIMESTAMP_MICROS. This is a problem as Presto (the underlying implementation of Athena) does not support timestamps in microsecond precision and makes the assumption that all timestamps are in millisecond precision. This does not cause any errors directly but it makes the times display as some extreme future date as it is

HIVE_UNKNOWN_ERROR when running AWS Athena query on Glue table (RDS)

可紊 提交于 2019-12-08 02:44:51
问题 I'm getting an error when running an Athena query against a Glue table created from an RDS database: HIVE_UNKNOWN_ERROR: Unable to create input format The tables are created using a crawler. The tables show up correctly in the Glue interface: However, they do not show up in the Athena interface under the database. It says: "The selected database has no tables" I do not see this behaviour when using a database created using an S3 file. Maybe this is related to the error. Does anybody have an

use SQL inside AWS Glue pySpark script

独自空忆成欢 提交于 2019-12-07 01:50:40
问题 I want to use AWS Glue to convert some csv data to orc. The ETL job I created generated the following PySpark script: import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job args = getResolvedOptions(sys.argv, ['JOB_NAME']) sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args['JOB_NAME'

How to skip headers when we are reading data from a csv file in s3 and creating a table in aws athena.

冷暖自知 提交于 2019-12-07 00:52:13
问题 I am trying to read csv data from s3 bucket and creating a table in AWS Athena. My table when created was unable to skip the header information of my CSV file. Query Example : CREATE EXTERNAL TABLE IF NOT EXISTS table_name ( `event_type_id` string, `customer_id` string, `date` string, `email` string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = "|", "quoteChar" = "\"" ) LOCATION 's3://location/' TBLPROPERTIES ("skip.header.line.count"

How to handle TIMESTAMP_MICROS parquet fields in Presto/Athena

柔情痞子 提交于 2019-12-06 16:00:44
Presently, we have a DMS task that will take the contents of a MySQL DB and dump files to S3 in parquet format. The format for the timestamps in parquet ends up being TIMESTAMP_MICROS. This is a problem as Presto (the underlying implementation of Athena) does not support timestamps in microsecond precision and makes the assumption that all timestamps are in millisecond precision. This does not cause any errors directly but it makes the times display as some extreme future date as it is interpreting the number of microseconds as number of milliseconds. We are currently working around this by

Alternatives for Athena to query the data on S3

馋奶兔 提交于 2019-12-06 13:57:25
问题 I have around 300 GBs of data on S3 . Lets say the data look like: ## S3://Bucket/Country/Month/Day/1.csv S3://Countries/Germany/06/01/1.csv S3://Countries/Germany/06/01/2.csv S3://Countries/Germany/06/01/3.csv S3://Countries/Germany/06/02/1.csv S3://Countries/Germany/06/02/2.csv We are doing some complex aggregation on the data, and because some countries data is big and some countries data is small, the AWS EMR doesn't makes sense to use, as once the small countries are finished, the

HIVE_UNKNOWN_ERROR when running AWS Athena query on Glue table (RDS)

余生颓废 提交于 2019-12-06 06:23:51
I'm getting an error when running an Athena query against a Glue table created from an RDS database: HIVE_UNKNOWN_ERROR: Unable to create input format The tables are created using a crawler. The tables show up correctly in the Glue interface: However, they do not show up in the Athena interface under the database. It says: "The selected database has no tables" I do not see this behaviour when using a database created using an S3 file. Maybe this is related to the error. Does anybody have an idea? I had the same problem. This is the answer that I have got from AWS Support: I understand that you

How to pivot rows into columns in AWS Athena?

試著忘記壹切 提交于 2019-12-06 06:11:31
问题 I'm new to AWS Athena and trying to pivot some rows into columns, similar to the top answer in this StackOverflow post. However, when I tried: SELECT column1, column2, column3 FROM data PIVOT ( MIN(column3) FOR column2 IN ('VALUE1','VALUE2','VALUE3','VALUE4') ) I get the error: mismatched input '(' expecting {',', ')'} (service: amazonathena; status code: 400; error code: invalidrequestexception Does anyone know how to accomplish what I am trying to achieve in AWS Athena? 回答1: Extending

AWS Athena (Presto) OFFSET support

寵の児 提交于 2019-12-06 03:56:02
I would like to know if there is support for OFFSET in AWS Athena. For mysql the following query is running but in athena it is giving me error. Any example would be helpful. select * from employee where empSal >3000 LIMIT 300 OFFSET 20 Athena is basically managed Presto. Since Presto 311 you can use OFFSET m LIMIT n syntax or ANSI SQL equivalent: OFFSET m ROWS FETCH NEXT n ROWS ONLY . For older versions (and this includes AWS Athena as of this writing) , you can use row_number() window function to implement OFFSET + LIMIT. For example, instead of SELECT * FROM elb_logs OFFSET 5 LIMIT 5 --