问题
I'm unable to figure out what is wrong with my CTAS query, it breaks the data into smaller files while storing inside a partition even though I haven't mentioned any bucketing columns. Is there a way to avoid these small files and store as one single file per partition as files lesser than 128 MB would cause additional overhead?
CREATE TABLE sampledb.yellow_trip_data_parquet
WITH(
format = 'PARQUET'
parquet_compression = 'GZIP',
external_location='s3://mybucket/Athena/tables/parquet/'
partitioned_by=ARRAY['year','month']
)
AS SELECT
VendorID,
tpep_pickup_datetime,
tpep_dropoff_datetime,
passenger_count,
trip_distance,
RatecodeID,
store_and_fwd_flag,
PULocationID,
DOLocationID,
payment_type,
fare_amount,
extra,
mta_tax,
tip_amount,
tolls_amount,
improvement_surcharge,
total_amount,
date_format(date_parse(tpep_pickup_datetime,'%Y-%c-%d %k:%i:%s'),'%Y') AS year,
date_format(date_parse(tpep_pickup_datetime,'%Y-%c-%d %k:%i:%s'),'%c') AS month
FROM sampleDB.yellow_trip_data_raw;

回答1:
I was able to overcome the issue by creating a bucketing column month_a
. Below is the code
CREATE TABLE sampledb.yellow_trip_data_avro
WITH (
format = 'AVRO',
external_location='s3://a4189e1npss3001/Athena/internal_tables/avro/',
partitioned_by=ARRAY['year','month'],
bucketed_by=ARRAY['month_a'],
bucket_count=12
) AS SELECT
VendorID,
tpep_pickup_datetime,
tpep_dropoff_datetime,
passenger_count,
trip_distance,
RatecodeID,
store_and_fwd_flag,
PULocationID,
DOLocationID,
payment_type,
fare_amount,
extra,
mta_tax,
tip_amount,
tolls_amount,
improvement_surcharge,
total_amount,
date_format(date_parse(tpep_pickup_datetime, '%Y-%c-%d %k:%i:%s'),'%c') AS month_a,
date_format(date_parse(tpep_pickup_datetime, '%Y-%c-%d %k:%i:%s'),'%Y') AS year,
date_format(date_parse(tpep_pickup_datetime, '%Y-%c-%d %k:%i:%s'),'%c') AS month
FROM sampleDB.yellow_trip_data_raw;
回答2:
Athena is a distributed system, and it will scale the execution on your query by some unobservable mechanism. It looks like it decided to use five workers for your CTAS query, which will result in five files in each partition.
You could try explicitly specifying a bucket size of one, but you might still get multiple files, if I remember correctly.
来源:https://stackoverflow.com/questions/54894503/how-to-avoid-aws-athena-ctas-query-creating-small-files