问题
We are trying out dask_yarn version 0.3.0 (with dask 0.18.2)
because of the conflicts between the boost-cpp i'm running with pyarrow
version 0.10.0
We are trying to read a csv file from hdfs - however we get an error when running dd.read_csv('hdfs:///path/to/file.csv')
since it is trying to use hdfs3.
ImportError: Can not find the shared library: libhdfs3.so
From the documentation it seems that there is an option to use pyarrow .
What is the correct syntax/configuration to do so?
回答1:
Try finding the file using locate -l 1 libhdfs.so
. In my case, the file is located under /opt/mapr/hadoop/hadoop-0.20.2/c++/Linux-amd64-64/lib
.
Then, restart your Jupyter server with the environment variable ARROW_LIBHDFS_DIR
set to this path. In my case, my command looks like:
ARROW_LIBHDFS_DIR=/opt/mapr/hadoop/hadoop-0.20.2/c++/Linux-amd64-64/lib jupyter lab --port 2250 --no-browser
Next, when you create the Yarn Cluster, pass this variable as a worker parameter:
# Create a cluster where each worker has two cores and eight GiB of memory
cluster = YarnCluster(
worker_env={
# See https://github.com/dask/dask-yarn/pull/30#issuecomment-434001858
'ARROW_LIBHDFS_DIR': '/opt/mapr/hadoop/hadoop-0.20.2/c++/Linux-amd64-64/lib',
},
)
This solved the problem for me.
(Inspired by https://gist.github.com/priancho/357022fbe63fae8b097a563e43dd885b)
来源:https://stackoverflow.com/questions/52205301/reading-csv-file-from-hdfs-using-dask-and-pyarrow