pyspark and HDFS commands

后端 未结 3 929
闹比i
闹比i 2020-12-13 16:47

I would like to do some cleanup at the start of my Spark program (Pyspark). For example, I would like to delete data from previous HDFS run. In pig this can be done using co

3条回答
  •  被撕碎了的回忆
    2020-12-13 17:42

    You can delete an hdfs path in PySpark without using third party dependencies as follows:

    from pyspark.sql import SparkSession
    # example of preparing a spark session
    spark = SparkSession.builder.appName('abc').getOrCreate()
    sc = spark.sparkContext
    # Prepare a FileSystem manager
    fs = (sc._jvm.org
          .apache.hadoop
          .fs.FileSystem
          .get(sc._jsc.hadoopConfiguration())
          )
    path = "Your/hdfs/path"
    # use the FileSystem manager to remove the path
    fs.delete(sc._jvm.org.apache.hadoop.fs.Path(path), True)
    

    To improve one step further, you can wrap the above idea into a helper function that you can re-use across jobs/packages:

    from pyspark.sql import SparkSession
    spark = SparkSession.builder.appName('abc').getOrCreate()
    
    def delete_path(spark, path):
        sc = spark.sparkContext
        fs = (sc._jvm.org
              .apache.hadoop
              .fs.FileSystem
              .get(sc._jsc.hadoopConfiguration())
              )
        fs.delete(sc._jvm.org.apache.hadoop.fs.Path(path), True)
    
    delete_path(spark, "Your/hdfs/path")
    

提交回复
热议问题