问题
How to specify Parquet Block Size and Page Size in PySpark? I have searched everywhere but cannot find any documentation for the function calls or the import libraries.
回答1:
According to spark-user archives
sc.hadoopConfiguration.setInt("dfs.blocksize", some_value)
sc.hadoopConfiguration.setInt("parquet.block.size", some_value)
so in PySpark
sc._jsc.hadoopConfiguration().setInt("dfs.blocksize", some_value)
sc._jsc.hadoopConfiguration().setInt("parquet.block.size", some_value)
来源:https://stackoverflow.com/questions/38236892/specify-parquet-properties-pyspark