I\'m unable to figure out what is wrong with my CTAS query, it breaks the data into smaller files while storing inside a partition even though I haven\'t mentioned any bucke
Athena is a distributed system, and it will scale the execution on your query by some unobservable mechanism. It looks like it decided to use five workers for your CTAS query, which will result in five files in each partition.
You could try explicitly specifying a bucket size of one, but you might still get multiple files, if I remember correctly.
I was able to overcome the issue by creating a bucketing column month_a
. Below is the code
CREATE TABLE sampledb.yellow_trip_data_avro
WITH (
format = 'AVRO',
external_location='s3://a4189e1npss3001/Athena/internal_tables/avro/',
partitioned_by=ARRAY['year','month'],
bucketed_by=ARRAY['month_a'],
bucket_count=12
) AS SELECT
VendorID,
tpep_pickup_datetime,
tpep_dropoff_datetime,
passenger_count,
trip_distance,
RatecodeID,
store_and_fwd_flag,
PULocationID,
DOLocationID,
payment_type,
fare_amount,
extra,
mta_tax,
tip_amount,
tolls_amount,
improvement_surcharge,
total_amount,
date_format(date_parse(tpep_pickup_datetime, '%Y-%c-%d %k:%i:%s'),'%c') AS month_a,
date_format(date_parse(tpep_pickup_datetime, '%Y-%c-%d %k:%i:%s'),'%Y') AS year,
date_format(date_parse(tpep_pickup_datetime, '%Y-%c-%d %k:%i:%s'),'%c') AS month
FROM sampleDB.yellow_trip_data_raw;