amazon-athena

AWS Glue Crawler cannot parse large files (classification UNKNOWN)

ε祈祈猫儿з 提交于 2020-07-10 10:27:29
问题 I've been working on trying to use the crawler from AWS Glue to try to obtain the columns and other features of a certain json file. I've parsed the json file locally by converting it to UTF-8 and using boto3 to move it into an s3 container and accessing that container from the crawler. I created a json classifier with the custom classifier $[*] and created a crawler with normal settings. When I do this with a file that is relatively small (<50 Kb) the crawler correctly identifies the columns

Querying S3 using Athena

穿精又带淫゛_ 提交于 2020-07-10 07:40:12
问题 I have a setup with Kinesis Firehose ingesting data, AWS Lambda performing data transformation and dropping the incoming data into an S3 bucket. The S3 structure is organized by year/month/day/hour/messages.json, so all of the actual json files I am querying are at the 'hour' level with all year, month, day directories only containing sub directories. My problem is I need to run a query to get all data for a given day. Is there an easy way to query at the 'day' directory level and return all

AWS Athena query returns results in incorrect format when query is run again

可紊 提交于 2020-07-09 04:30:09
问题 The first time I ran the query, it returned 2 rows with columns names. I edited the table and added skip.header.line.count - 1 and reran(First time), but it returned same result with double inverted commas. Then reran again(Second time), and this changed everything. First time Query run output: https://i.stack.imgur.com/k6T2O.png Second time Query run output: https://i.stack.imgur.com/6Cxrf.png 回答1: The problem is that output files from Amazon Athena are being mixed-in with your source files.

AWS Athena too slow for an api?

让人想犯罪 __ 提交于 2020-07-07 05:37:20
问题 The plan was to get data from aws data exchange, move it to an s3 bucket then query it by aws athena for a data api. Everything works, just feels a bit slow. No matter the dataset nor the query I can't get below 2 second in athena response time. Which is a lot for an API. I checked the best practices but seems that those are also above 2 sec. So my question: Is 2 sec the minimal response time for athena? If so then I have to switch to postgres. 回答1: Athena is indeed not a low latency data

Athena Query for Array Column

痴心易碎 提交于 2020-06-29 03:33:13
问题 I require your help in querying the array column in athena. Presently i have a table as mentioned below: 1 2020-05-06 01:13:48 dv1 [{addedtitle=apple, addedvalue=null, keytitle=Increase apple, key=p9, recvalue=0.899999999, unit=lbs, isbalanced=null}, {addedtitle=Orange (12%), addedvalue=15.0, keytitle=Increase Orange, key=p8, recvalue=18.218999999999998, unit=fl oz, isbalanced=null}, {addedtitle=Lemon, addedvalue=32.0, keytitle=Increase Lemon, key=p10, recvalue=33.6, unit=oz, isbalanced=null}

How to update Athena output location using Cloudformation

爱⌒轻易说出口 提交于 2020-06-28 09:02:41
问题 Can someone help me write a Cloud formation script to update output location of Athena primary workgroup. When i run below code, getting error message "Invalid request provided: primary workGroup could not be created (Service: Athena, Status Code: 400, Request ID: 9945209c-6999-4e8b-bd3d-a3af13b4ac4f)" Resources: MyAthenaWorkGroup: Type: AWS::Athena::WorkGroup Properties: Name: primary Description: My WorkGroup Updated State: DISABLED WorkGroupConfigurationUpdates: BytesScannedCutoffPerQuery:

How to handle embed line breaks in AWS Athena

微笑、不失礼 提交于 2020-06-25 10:03:17
问题 I have created a table in AWS Athena like this: CREATE EXTERNAL TABLE IF NOT EXISTS default.test_line_breaks ( col1 string, col2 string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( 'separatorChar' = ',', 'quoteChar' = '\"', 'escapeChar' = '\\' ) STORED AS TEXTFILE LOCATION 's3://bucket/test/' In the bucket I put a simple CSV file with the following context: rec1 col1,rec2 col2 rec2 col1,"rec2, col2" rec3 col1,"rec3 col2" When I run data preview

Unnesting in SQL (Athena): How to convert array of structs into an array of values plucked from the structs?

余生长醉 提交于 2020-06-25 08:37:55
问题 I am taking samples from a Bayesian statistical model, serializing them with Avro, uploading them to S3, and querying them with Athena. I need help writing a query that unnests an array in the table. The CREATE TABLE query looks like: CREATE EXTERNAL TABLE `model_posterior`( `job_id` bigint, `model_id` bigint, `parents` array<struct<`feature_name`:string,`feature_value`:bigint, `is_zid`:boolean>>, `posterior_samples` struct <`parameter`:string,`is_scaled`:boolean,`samples`:array<double>>) The

AWS Glue crawler need to create one table from many files with identical schemas

北慕城南 提交于 2020-06-23 06:52:37
问题 We have a very large number of folders and files in S3, all under one particular folder, and we want to crawl for all the CSV files, and then query them from one table in Athena. The CSV files all have the same schema. The problem is that the crawler is generating a table for every file, instead of one table. Crawler configurations have a checkbox option to "Create a single schema for each S3 path" but this doesn't seem to do anything. Is what I need possible? Thanks. 回答1: Glue crawlers

AWS Glue crawler need to create one table from many files with identical schemas

。_饼干妹妹 提交于 2020-06-23 06:52:03
问题 We have a very large number of folders and files in S3, all under one particular folder, and we want to crawl for all the CSV files, and then query them from one table in Athena. The CSV files all have the same schema. The problem is that the crawler is generating a table for every file, instead of one table. Crawler configurations have a checkbox option to "Create a single schema for each S3 path" but this doesn't seem to do anything. Is what I need possible? Thanks. 回答1: Glue crawlers