This solution is based on a Shell Script and is not parallelized, but is still very fast, especially on SSDs. It uses cat
and output redirection on Unix systems. Suppose that the CSV directory containing partitions is located on /my/csv/dir
and that the output file is /my/csv/output.csv
:
#!/bin/bash
echo "col1,col2,col3" > /my/csv/output.csv
for i in /my/csv/dir/*.csv ; do
echo "Processing $i"
cat $i >> /my/csv/output.csv
rm $i
done
echo "Done"
It will remove each partition after appending it to the final CSV in order to free space.
"col1,col2,col3"
is the CSV header (here we have three columns of name col1
, col2
and col3
). You must tell Spark to don't put the header in each partition (this is accomplished with .option("header", "false")
because the Shell Script will do it.