Concatenate a large number of HDF5 files

后端 未结 3 1037
迷失自我
迷失自我 2020-12-09 13:01

I have about 500 HDF5 files each of about 1.5 GB.

Each of the files has the same exact structure, which is 7 compound (int,double,double) datasets and variable numbe

相关标签:
3条回答
  • 2020-12-09 13:18

    I get that answering this earns me a necro badge - but things have improved for me in this area recently.

    In Julia this takes a few seconds.

    1. Create a txt file that lists all the hdf5 file paths (you can use bash to do this in one go if there are lots)
    2. In a loop read each line of txt file and use label$i = h5read(original_filepath$i, "/label")
    3. concat all the labels label = [label label$i]
    4. Then just write: h5write(data_file_path, "/label", label)

    Same can be done if you have groups or more complicated hdf5 files.

    0 讨论(0)
  • 2020-12-09 13:25

    Ashley's answer worked well for me. Here is an implementation of her suggestion in Julia:

    Make text file listing the files to concatenate in bash:

    ls -rt $somedirectory/$somerootfilename-*.hdf5 >> listofHDF5files.txt
    

    Write a julia script to concatenate multiple files into one file:

    # concatenate_HDF5.jl
    using HDF5
    
    inputfilepath=ARGS[1]
    outputfilepath=ARGS[2]
    
    f = open(inputfilepath)
    firstit=true
    data=[]
    for line in eachline(f)
        r = strip(line, ['\n'])
        print(r,"\n")
        datai = h5read(r, "/data")
        if (firstit)
            data=datai
            firstit=false
        else
            data=cat(4,data, datai) #In this case concatenating on 4th dimension
        end
    end
    h5write(outputfilepath, "/data", data)
    

    Then execute the script file above using:

    julia concatenate_HDF5.jl listofHDF5files.txt final_concatenated_HDF5.hdf5
    
    0 讨论(0)
  • 2020-12-09 13:28

    I found that most of the time was spent in resizing the file, as I was resizing at each step, so I am now first going trough all my files and get their length (it is variable).

    Then I create the global h5file setting the total length to the sum of all the files.

    Only after this phase I fill the h5file with the data from all the small files.

    now it takes about 10 seconds for each file, so it should take less than 2 hours, while before it was taking much more.

    0 讨论(0)
提交回复
热议问题