Spark deduplication of RDD to get bigger RDD

房东的猫 提交于 2019-12-12 05:57:20

问题


I have a dataframe loaded from disk

df_ = sqlContext.read.json("/Users/spark_stats/test.json")

It contains 500k rows.
my script works fine on this size, but I want to test it for example on 5M rows, is there a way to duplicate the df 9 times? (it does not matter for me to have duplicates in the df)

i already use union but it is really too slow (as I think it keeps reading from the disk everytime)

df = df_
for i in range(9): 
    df = df.union(df_)

Do you have an idea about a clean way to do that?

Thanks


回答1:


You can use explode. It should only read from the raw disk once:

from pyspark.sql.types import *
from pyspark.sql.functions import *

schema = StructType([StructField("f1", StringType()), StructField("f2", StringType())])

data = [("a", "b"), ("c", "d")]
rdd = sc.parallelize(data)
df = sqlContext.createDataFrame(rdd, schema)

# Create an array with as many values as times you want to duplicate the rows
dups_array = [lit(i) for i in xrange(9)]
duplicated = df.withColumn("duplicate", array(*dups_array)) \
               .withColumn("duplicate", explode("duplicate")) \
               .drop("duplicate")


来源:https://stackoverflow.com/questions/44417445/spark-deduplication-of-rdd-to-get-bigger-rdd

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!