pyspark and HDFS commands

后端 未结 3 918
闹比i
闹比i 2020-12-13 16:47

I would like to do some cleanup at the start of my Spark program (Pyspark). For example, I would like to delete data from previous HDFS run. In pig this can be done using co

相关标签:
3条回答
  • 2020-12-13 17:23

    You can execute arbitrary shell command using form example subprocess.call or sh library so something like this should work just fine:

    import subprocess
    
    some_path = ...
    subprocess.call(["hadoop", "fs", "-rm", "-f", some_path])
    

    If you use Python 2.x you can try using spotify/snakebite:

    from snakebite.client import Client
    
    host = ...
    port = ...
    client = Client(host, port)
    client.delete(some_path, recurse=True)
    

    hdfs3 is yet another library which can be used to do the same thing:

    from hdfs3 import HDFileSystem
    
    hdfs = HDFileSystem(host=host, port=port)
    HDFileSystem.rm(some_path)
    

    Apache Arrow Python bindings are the latest option (and that often is already available on Spark cluster, as it is required for pandas_udf):

    from pyarrow import hdfs
    
    fs = hdfs.connect(host, port)
    fs.delete(some_path, recursive=True)
    
    0 讨论(0)
  • 2020-12-13 17:34

    from https://diogoalexandrefranco.github.io/interacting-with-hdfs-from-pyspark/ using only PySpark

    ######
    # Get fs handler from java gateway
    ######
    URI = sc._gateway.jvm.java.net.URI
    Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
    FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
    fs = FileSystem.get(URI("hdfs://somehost:8020"), sc._jsc.hadoopConfiguration())
    
    # We can now use the Hadoop FileSystem API (https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html)
    fs.listStatus(Path('/user/hive/warehouse'))
    # or
    fs.delete(Path('some_path'))
    

    the other solutions didn't work in my case, but this blog post helped :)

    0 讨论(0)
  • 2020-12-13 17:42

    You can delete an hdfs path in PySpark without using third party dependencies as follows:

    from pyspark.sql import SparkSession
    # example of preparing a spark session
    spark = SparkSession.builder.appName('abc').getOrCreate()
    sc = spark.sparkContext
    # Prepare a FileSystem manager
    fs = (sc._jvm.org
          .apache.hadoop
          .fs.FileSystem
          .get(sc._jsc.hadoopConfiguration())
          )
    path = "Your/hdfs/path"
    # use the FileSystem manager to remove the path
    fs.delete(sc._jvm.org.apache.hadoop.fs.Path(path), True)
    

    To improve one step further, you can wrap the above idea into a helper function that you can re-use across jobs/packages:

    from pyspark.sql import SparkSession
    spark = SparkSession.builder.appName('abc').getOrCreate()
    
    def delete_path(spark, path):
        sc = spark.sparkContext
        fs = (sc._jvm.org
              .apache.hadoop
              .fs.FileSystem
              .get(sc._jsc.hadoopConfiguration())
              )
        fs.delete(sc._jvm.org.apache.hadoop.fs.Path(path), True)
    
    delete_path(spark, "Your/hdfs/path")
    
    0 讨论(0)
提交回复
热议问题