How to force a certain partitioning in a PySpark DataFrame?

微笑、不失礼 提交于 2019-12-04 13:17:13

问题


Suppose I have a DataFrame with a column partition_id:

n_partitions = 2

df = spark.sparkContext.parallelize([
    [1, 'A'],
    [1, 'B'],
    [2, 'A'],
    [2, 'C']
]).toDF(('partition_id', 'val'))

How can I repartition the DataFrame to guarantee that each value of partition_id goes to a separate partition, and that there are exactly as many actual partitions as there are distinct values of partition_id?

If I do a hash partition, i.e. df.repartition(n_partitions, 'partition_id'), that guarantees the right number of partitions, but some partitions may be empty and others may contain multiple values of partition_id due to hash collisions.


回答1:


There is no such option with Python and DataFrame API. Partitioning API in Dataset is not plugable and supports only predefined range and hash partitioning schemes.

You can convert data to RDD, partition with custom partitioner, and read convert back to DataFrame:

from pyspark.sql.functions import col, struct, spark_partition_id

mapping = {k: i for i, k in enumerate(
    df.select("partition_id").distinct().rdd.flatMap(lambda x: x).collect()
)}

result = (df
    .select("partition_id", struct([c for c in df.columns]))
    .rdd.partitionBy(len(mapping), lambda k: mapping[k])
    .values()
    .toDF(df.schema))

result.withColumn("actual_partition_id", spark_partition_id()).show()
# +------------+---+-------------------+
# |partition_id|val|actual_partition_id|
# +------------+---+-------------------+
# |           1|  A|                  0|
# |           1|  B|                  0|
# |           2|  A|                  1|
# |           2|  C|                  1|
# +------------+---+-------------------+

Please remember that this only creates specific distribution of data and doesn't set partitioner that can be used by Catalyst optimizer.



来源:https://stackoverflow.com/questions/50757050/how-to-force-a-certain-partitioning-in-a-pyspark-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!