How to avoid AWS Athena CTAS query creating small files?

人盡茶涼 提交于 2019-12-24 01:16:16

问题


I'm unable to figure out what is wrong with my CTAS query, it breaks the data into smaller files while storing inside a partition even though I haven't mentioned any bucketing columns. Is there a way to avoid these small files and store as one single file per partition as files lesser than 128 MB would cause additional overhead?

CREATE TABLE sampledb.yellow_trip_data_parquet
WITH(
    format = 'PARQUET'
    parquet_compression = 'GZIP',
    external_location='s3://mybucket/Athena/tables/parquet/'
    partitioned_by=ARRAY['year','month']
)
AS SELECT
    VendorID,
    tpep_pickup_datetime,
    tpep_dropoff_datetime,
    passenger_count,
    trip_distance,
    RatecodeID,
    store_and_fwd_flag,
    PULocationID,
    DOLocationID,
    payment_type,
    fare_amount,
    extra,
    mta_tax,
    tip_amount,
    tolls_amount,
    improvement_surcharge,
    total_amount,
    date_format(date_parse(tpep_pickup_datetime,'%Y-%c-%d %k:%i:%s'),'%Y')  AS year,
    date_format(date_parse(tpep_pickup_datetime,'%Y-%c-%d %k:%i:%s'),'%c')  AS month
FROM sampleDB.yellow_trip_data_raw;


回答1:


I was able to overcome the issue by creating a bucketing column month_a. Below is the code

CREATE TABLE sampledb.yellow_trip_data_avro
WITH (
    format = 'AVRO',
    external_location='s3://a4189e1npss3001/Athena/internal_tables/avro/',
    partitioned_by=ARRAY['year','month'],
    bucketed_by=ARRAY['month_a'],
    bucket_count=12
) AS SELECT
    VendorID,
    tpep_pickup_datetime,
    tpep_dropoff_datetime,
    passenger_count,
    trip_distance,
    RatecodeID,
    store_and_fwd_flag,
    PULocationID,
    DOLocationID,
    payment_type,
    fare_amount,
    extra,
    mta_tax,
    tip_amount,
    tolls_amount,
    improvement_surcharge,
    total_amount,
    date_format(date_parse(tpep_pickup_datetime, '%Y-%c-%d %k:%i:%s'),'%c') AS month_a,
    date_format(date_parse(tpep_pickup_datetime, '%Y-%c-%d %k:%i:%s'),'%Y') AS year,
    date_format(date_parse(tpep_pickup_datetime, '%Y-%c-%d %k:%i:%s'),'%c') AS month
FROM sampleDB.yellow_trip_data_raw;



回答2:


Athena is a distributed system, and it will scale the execution on your query by some unobservable mechanism. It looks like it decided to use five workers for your CTAS query, which will result in five files in each partition.

You could try explicitly specifying a bucket size of one, but you might still get multiple files, if I remember correctly.



来源:https://stackoverflow.com/questions/54894503/how-to-avoid-aws-athena-ctas-query-creating-small-files

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!