Spark dataframe write method writing many small files

前端 未结 6 1693
轮回少年
轮回少年 2020-11-27 17:34

I\'ve got a fairly simple job coverting log files to parquet. It\'s processing 1.1TB of data (chunked into 64MB - 128MB files - our block size is 128MB), which is approx 12

6条回答
  •  醉酒成梦
    2020-11-27 18:31

    how about trying running scripts like this as map job consolidating all the parquet files into one:

    $ hadoop jar /usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-streaming-2.7.1.2.3.2.0-2950.jar \
     -Dmapred.reduce.tasks=1 \
     -input "/hdfs/input/dir" \
     -output "/hdfs/output/dir" \
     -mapper cat \
     -reducer cat
    

提交回复
热议问题