Context
Spark 2.0.1, spark-submit in cluster mode. I am reading a parquet file from hdfs:
val spark = SparkSession.builder
.ap
In my mind also these questions poped up when I saw too many files so searched and found this
"Unlike bucketing in Apache Hive, Spark SQL creates the bucket files per the number of buckets and partitions. In other words, the number of bucketing files is the number of buckets multiplied by the number of task writers (one per partition)."
Source: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-bucketing.html
I think this answers your question why this no. of files
Your question no. 2 can be answered like If we could manage no. of partitions by repartition, provided the resource available, we can limit the files created.