How to avoid AWS Athena CTAS query creating small files?

后端 未结 2 724
借酒劲吻你
借酒劲吻你 2020-12-19 15:59

I\'m unable to figure out what is wrong with my CTAS query, it breaks the data into smaller files while storing inside a partition even though I haven\'t mentioned any bucke

相关标签:
2条回答
  • 2020-12-19 16:38

    Athena is a distributed system, and it will scale the execution on your query by some unobservable mechanism. It looks like it decided to use five workers for your CTAS query, which will result in five files in each partition.

    You could try explicitly specifying a bucket size of one, but you might still get multiple files, if I remember correctly.

    0 讨论(0)
  • 2020-12-19 16:54

    I was able to overcome the issue by creating a bucketing column month_a. Below is the code

    CREATE TABLE sampledb.yellow_trip_data_avro
    WITH (
        format = 'AVRO',
        external_location='s3://a4189e1npss3001/Athena/internal_tables/avro/',
        partitioned_by=ARRAY['year','month'],
        bucketed_by=ARRAY['month_a'],
        bucket_count=12
    ) AS SELECT
        VendorID,
        tpep_pickup_datetime,
        tpep_dropoff_datetime,
        passenger_count,
        trip_distance,
        RatecodeID,
        store_and_fwd_flag,
        PULocationID,
        DOLocationID,
        payment_type,
        fare_amount,
        extra,
        mta_tax,
        tip_amount,
        tolls_amount,
        improvement_surcharge,
        total_amount,
        date_format(date_parse(tpep_pickup_datetime, '%Y-%c-%d %k:%i:%s'),'%c') AS month_a,
        date_format(date_parse(tpep_pickup_datetime, '%Y-%c-%d %k:%i:%s'),'%Y') AS year,
        date_format(date_parse(tpep_pickup_datetime, '%Y-%c-%d %k:%i:%s'),'%c') AS month
    FROM sampleDB.yellow_trip_data_raw;
    
    0 讨论(0)
提交回复
热议问题