Hive Create Multi small files for each insert in HDFS

前端 未结 3 642
轻奢々
轻奢々 2020-12-14 13:19

following is already been achieved

  1. Kafka Producer pulling data from twitter using Spark Streaming.
  2. Kafka Consumer ingesting data into Hive External t
3条回答
  •  情书的邮戳
    2020-12-14 14:05

    [take 2] OK, so you can't properly "stream" data into Hive. But you can add a periodic compaction post-processing job...

    • create your table with 3 partitions e.g. (role='collectA'), (role='collectB'), (role='archive')
    • point your Spark inserts to (role='activeA')
    • at some point, switch to (role='activeB')
    • then dump every record that you have collected in the "A" partition into "archive", hoping that Hive default config will do a good job of limiting fragmentation

      INSERT INTO TABLE twitter_data PARTITION (role='archive') SELECT ... FROM twitter_data WHERE role='activeA' ; TRUNCATE TABLE twitter_data PARTITION (role='activeA') ;

    • at some point, switch back to "A" etc.

    One last word: if Hive still creates too many files on each compaction job, then try tweaking some parameters in your session, just before the INSERT e.g.

    set hive.merge.mapfiles =true;
    set hive.merge.mapredfiles =true;
    set hive.merge.smallfiles.avgsize=1024000000;
    

提交回复
热议问题