Why do pandas and dask perform better when importing from CSV compared to HDF5?

我只是一个虾纸丫 提交于 2019-12-03 17:14:47

HDF5 is most efficient when working with numerical data, I'm guessing you are reading a single string column, which is its weakpoint.

Performance of string data with HDF5 can be dramatically improved by using a Categorical to store your strings, assuming relatively low cardinality (high number of repeated values)

It's from a little while back, but a good blog post here going through exactly these considerations. http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization

You may also look at using parquet - it is similar to HDF5 in that it is a binary format, but is column oriented, so a single column selection like this will likely be faster.

Recently (2016-2017) there has been significant work to implement a fast native reader of parquet->pandas, and the next major release of pandas (0.21) will have to_parquet and pd.read_parquet functions built in.

https://arrow.apache.org/docs/python/parquet.html

https://fastparquet.readthedocs.io/en/latest/

https://matthewrocklin.com/blog//work/2017/06/28/use-parquet

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!