Fastest way to write HDF5 files with Python?

前端 未结 3 1369
感动是毒
感动是毒 2020-12-13 10:31

Given a large (10s of GB) CSV file of mixed text/numbers, what is the fastest way to create an HDF5 file with the same content, while keeping the memory usage reasonable?

3条回答
  •  南笙
    南笙 (楼主)
    2020-12-13 10:31

    I'm not sure if this is the most efficient way (and I've never used it; I'm just pulling together some tools I've used independently), but you could read the csv file into a numpy recarray using the matplotlib helper methods for csv.

    You can probably find a way to read the csv files in chunks as well to avoid loading the whole thing to disk. Then use the recarray (or slices therein) to write the whole (or large chunks of it) to the h5py dataset. I'm not exactly sure how h5py handles recarrays, but the documentation indicates that it should be ok.

    Basically if possible, try to write big chunks of data at once instead of iterating over individual elements.

    Another possibility for reading the csv file is just numpy.genfromtxt

    You can grab the columns you want using the keyword usecols, and then only read in a specified set of lines by properly setting the skip_header and skip_footer keywords.

提交回复
热议问题