发表新帖

发表新帖

Hive Create Multi small files for each insert in HDFS

前端未结

关注

 3  642

轻奢々 2020-12-14 13:19

following is already been achieved

Kafka Producer pulling data from twitter using Spark Streaming.
Kafka Consumer ingesting data into Hive External t

3条回答

情书的邮戳 (楼主)

2020-12-14 14:05
[take 2] OK, so you can't properly "stream" data into Hive. But you can add a periodic compaction post-processing job...
- create your table with 3 partitions e.g. (role='collectA'), (role='collectB'), (role='archive')
- point your Spark inserts to (role='activeA')
- at some point, switch to (role='activeB')
- then dump every record that you have collected in the "A" partition into "archive", hoping that Hive default config will do a good job of limiting fragmentation
  
  INSERT INTO TABLE twitter_data PARTITION (role='archive') SELECT ... FROM twitter_data WHERE role='activeA' ; TRUNCATE TABLE twitter_data PARTITION (role='activeA') ;
- at some point, switch back to "A" etc.
One last word: if Hive still creates too many files on each compaction job, then try tweaking some parameters in your session, just before the INSERT e.g.
```
set hive.merge.mapfiles =true;
set hive.merge.mapredfiles =true;
set hive.merge.smallfiles.avgsize=1024000000;
```
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

热议问题