Reading csv file from hdfs using dask and pyarrow

问题

We are trying out dask_yarn version 0.3.0 (with dask 0.18.2) because of the conflicts between the boost-cpp i'm running with pyarrow version 0.10.0
We are trying to read a csv file from hdfs - however we get an error when running dd.read_csv('hdfs:///path/to/file.csv') since it is trying to use hdfs3.

ImportError: Can not find the shared library: libhdfs3.so

From the documentation it seems that there is an option to use pyarrow .

What is the correct syntax/configuration to do so?

回答1:

Try finding the file using locate -l 1 libhdfs.so. In my case, the file is located under /opt/mapr/hadoop/hadoop-0.20.2/c++/Linux-amd64-64/lib.

Then, restart your Jupyter server with the environment variable ARROW_LIBHDFS_DIR set to this path. In my case, my command looks like:

ARROW_LIBHDFS_DIR=/opt/mapr/hadoop/hadoop-0.20.2/c++/Linux-amd64-64/lib jupyter lab --port 2250 --no-browser

Next, when you create the Yarn Cluster, pass this variable as a worker parameter:

# Create a cluster where each worker has two cores and eight GiB of memory
cluster = YarnCluster(
    worker_env={
        # See https://github.com/dask/dask-yarn/pull/30#issuecomment-434001858
        'ARROW_LIBHDFS_DIR': '/opt/mapr/hadoop/hadoop-0.20.2/c++/Linux-amd64-64/lib',
    },
)

This solved the problem for me.

(Inspired by https://gist.github.com/priancho/357022fbe63fae8b097a563e43dd885b)

来源：https://stackoverflow.com/questions/52205301/reading-csv-file-from-hdfs-using-dask-and-pyarrow

标签

dask

pyarrow

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!