Spark deduplication of RDD to get bigger RDD
问题 I have a dataframe loaded from disk df_ = sqlContext.read.json("/Users/spark_stats/test.json") It contains 500k rows. my script works fine on this size, but I want to test it for example on 5M rows, is there a way to duplicate the df 9 times? (it does not matter for me to have duplicates in the df) i already use union but it is really too slow (as I think it keeps reading from the disk everytime) df = df_ for i in range(9): df = df.union(df_) Do you have an idea about a clean way to do that?