Fastest way to write HDF5 files with Python?

前端未结

关注

 3  1369

感动是毒 2020-12-13 10:31

Given a large (10s of GB) CSV file of mixed text/numbers, what is the fastest way to create an HDF5 file with the same content, while keeping the memory usage reasonable?

3条回答

南笙 (楼主)

2020-12-13 10:31

I'm not sure if this is the most efficient way (and I've never used it; I'm just pulling together some tools I've used independently), but you could read the csv file into a numpy recarray using the matplotlib helper methods for csv.

You can probably find a way to read the csv files in chunks as well to avoid loading the whole thing to disk. Then use the recarray (or slices therein) to write the whole (or large chunks of it) to the h5py dataset. I'm not exactly sure how h5py handles recarrays, but the documentation indicates that it should be ok.

Basically if possible, try to write big chunks of data at once instead of iterating over individual elements.

Another possibility for reading the csv file is just numpy.genfromtxt

You can grab the columns you want using the keyword usecols, and then only read in a specified set of lines by properly setting the skip_header and skip_footer keywords.

0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...