pyspark and HDFS commands

后端 未结 3 925
闹比i
闹比i 2020-12-13 16:47

I would like to do some cleanup at the start of my Spark program (Pyspark). For example, I would like to delete data from previous HDFS run. In pig this can be done using co

3条回答
  •  情深已故
    2020-12-13 17:34

    from https://diogoalexandrefranco.github.io/interacting-with-hdfs-from-pyspark/ using only PySpark

    ######
    # Get fs handler from java gateway
    ######
    URI = sc._gateway.jvm.java.net.URI
    Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
    FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
    fs = FileSystem.get(URI("hdfs://somehost:8020"), sc._jsc.hadoopConfiguration())
    
    # We can now use the Hadoop FileSystem API (https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html)
    fs.listStatus(Path('/user/hive/warehouse'))
    # or
    fs.delete(Path('some_path'))
    

    the other solutions didn't work in my case, but this blog post helped :)

提交回复
热议问题