Do Spark/Parquet partitions maintain ordering?
问题 If I partition a data set, will it be in the correct order when I read it back? For example, consider the following pyspark code: # read a csv df = sql_context.read.csv(input_filename) # add a hash column hash_udf = udf(lambda customer_id: hash(customer_id) % 4, IntegerType()) df = df.withColumn('hash', hash_udf(df['customer_id'])) # write out to parquet df.write.parquet(output_path, partitionBy=['hash']) # read back the file df2 = sql_context.read.parquet(output_path) I am partitioning on a