I would like to do some cleanup at the start of my Spark program (Pyspark). For example, I would like to delete data from previous HDFS run. In pig this can be done using co
You can execute arbitrary shell command using form example subprocess.call or sh library so something like this should work just fine:
import subprocess
some_path = ...
subprocess.call(["hadoop", "fs", "-rm", "-f", some_path])
If you use Python 2.x you can try using spotify/snakebite:
from snakebite.client import Client
host = ...
port = ...
client = Client(host, port)
client.delete(some_path, recurse=True)
hdfs3 is yet another library which can be used to do the same thing:
from hdfs3 import HDFileSystem
hdfs = HDFileSystem(host=host, port=port)
HDFileSystem.rm(some_path)
Apache Arrow Python bindings are the latest option (and that often is already available on Spark cluster, as it is required for pandas_udf
):
from pyarrow import hdfs
fs = hdfs.connect(host, port)
fs.delete(some_path, recursive=True)
from https://diogoalexandrefranco.github.io/interacting-with-hdfs-from-pyspark/ using only PySpark
######
# Get fs handler from java gateway
######
URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
fs = FileSystem.get(URI("hdfs://somehost:8020"), sc._jsc.hadoopConfiguration())
# We can now use the Hadoop FileSystem API (https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html)
fs.listStatus(Path('/user/hive/warehouse'))
# or
fs.delete(Path('some_path'))
the other solutions didn't work in my case, but this blog post helped :)
You can delete an hdfs
path in PySpark
without using third party dependencies as follows:
from pyspark.sql import SparkSession
# example of preparing a spark session
spark = SparkSession.builder.appName('abc').getOrCreate()
sc = spark.sparkContext
# Prepare a FileSystem manager
fs = (sc._jvm.org
.apache.hadoop
.fs.FileSystem
.get(sc._jsc.hadoopConfiguration())
)
path = "Your/hdfs/path"
# use the FileSystem manager to remove the path
fs.delete(sc._jvm.org.apache.hadoop.fs.Path(path), True)
To improve one step further, you can wrap the above idea into a helper function that you can re-use across jobs/packages:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()
def delete_path(spark, path):
sc = spark.sparkContext
fs = (sc._jvm.org
.apache.hadoop
.fs.FileSystem
.get(sc._jsc.hadoopConfiguration())
)
fs.delete(sc._jvm.org.apache.hadoop.fs.Path(path), True)
delete_path(spark, "Your/hdfs/path")