How to control partition size in Spark SQL
I have a requirement to load data from an Hive table using Spark SQL HiveContext and load into HDFS. By default, the DataFrame from SQL output is having 2 partitions. To get more parallelism i need more partitions out of the SQL. There is no overloaded method in HiveContex t to take number of partitions parameter. Repartitioning of the RDD causes shuffling and results in more processing time. > val result = sqlContext.sql("select * from bt_st_ent") Has the log output of: Starting task 0.0 in stage 131.0 (TID 297, aster1.com, partition 0,NODE_LOCAL, 2203 bytes) Starting task 1.0 in stage 131.0