Write to multiple outputs by key Spark - one Spark job

后端 未结 10 1790
挽巷
挽巷 2020-11-22 05:08

How can you write to multiple outputs dependent on the key using Spark in a single Job.

Related: Write to multiple outputs by key Scalding Hadoop, one MapReduce Job<

10条回答
  •  情话喂你
    2020-11-22 05:58

    good news for python user in the case you have multi columns and you want to save all the other columns not partitioned in csv format which will failed if you use "text" method as Nick Chammas' suggestion .

    people_df.write.partitionBy("number").text("people") 
    

    error message is "AnalysisException: u'Text data source supports only a single column, and you have 2 columns.;'"

    In spark 2.0.0 (my test enviroment is hdp's spark 2.0.0) package "com.databricks.spark.csv" is now integrated , and it allow us save text file partitioned by only one column, see the example blow:

    people_rdd = sc.parallelize([(1,"2016-12-26", "alice"),
                                 (1,"2016-12-25", "alice"),
                                 (1,"2016-12-25", "tom"), 
                                 (1, "2016-12-25","bob"), 
                                 (2,"2016-12-26" ,"charlie")])
    df = people_rdd.toDF(["number", "date","name"])
    
    df.coalesce(1).write.partitionBy("number").mode("overwrite").format('com.databricks.spark.csv').options(header='false').save("people")
    
    [root@namenode people]# tree
    .
    ├── number=1
    │?? └── part-r-00000-6bd1b9a8-4092-474a-9ca7-1479a98126c2.csv
    ├── number=2
    │?? └── part-r-00000-6bd1b9a8-4092-474a-9ca7-1479a98126c2.csv
    └── _SUCCESS
    
    [root@namenode people]# cat number\=1/part-r-00000-6bd1b9a8-4092-474a-9ca7-1479a98126c2.csv
    2016-12-26,alice
    2016-12-25,alice
    2016-12-25,tom
    2016-12-25,bob
    [root@namenode people]# cat number\=2/part-r-00000-6bd1b9a8-4092-474a-9ca7-1479a98126c2.csv
    2016-12-26,charlie
    

    In my spark 1.6.1 enviroment ,the code didn't throw any error,however ther is only one file generated. it's not partitioned by two folders.

    Hope this can help .

提交回复
热议问题