Writing/reading large vectors of data to binary file in c++

问题

I have a c++ program that computes populations within a given radius by reading gridded population data from an ascii file into a large 8640x3432-element vector of doubles. Reading the ascii data into the vector takes ~30 seconds (looping over each column and each row), while the rest of the program only takes a few seconds. I was asked to speed up this process by writing the population data to a binary file, which would supposedly read in faster.

The ascii data file has a few header rows that give some data specs like the number of columns and rows, followed by population data for each grid cell, which is formatted as 3432 rows of 8640 numbers, separated by spaces. The population data numbers are mixed formats and can be just 0, a decimal value (0.000685648), or a value in scientific notation (2.687768e-05).

I found a few examples of reading/writing structs containing vectors to binary, and tried to implement something similar, but am running into problems. When I both write and read the vector to/from the binary file in the same program, it seems to work and gives me all the correct values, but then it ends with either a "segment fault: 11" or a memory allocation error that a "pointer being freed was not allocated". And if I try to just read the data in from the previously written binary file (without re-writing it in the same program run), then it gives me the header variables just fine but gives me a segfault before giving me the vector data.

Any advice on what I might have done wrong, or on a better way to do this would be greatly appreciated! I am compiling and running on a mac, and I don't have boost or other non-standard libraries at present. (Note: I am extremely new at coding and am having to learn by jumping in the deep end, so I may be missing a lot of basic concepts and terminology -- sorry!).

Here is the code I came up with:

# include <stdio.h>
# include <stdlib.h>
# include <string.h>
# include <fstream>
# include <iostream>
# include <vector>
# include <string.h>

using namespace std;

//Define struct for population file data and initialize one struct variable for reading in ascii (A) and one for reading in binary (B)
struct popFileData
{
    int nRows, nCol;
    vector< vector<double> > popCount; //this will end up having 3432x8640 elements
} popDataA, popDataB;

int main() {

    string gridFname = "sample";

    double dum;
    vector<double> tempVector;

    //open ascii population grid file to stream
    ifstream gridFile;
    gridFile.open(gridFname + ".asc");

    int i = 0, j = 0;

    if (gridFile.is_open())
    {
        //read in header data from file
        string fileLine;
        gridFile >> fileLine >> popDataA.nCol;
        gridFile >> fileLine >> popDataA.nRows;

        popDataA.popCount.clear();

        //read in vector data, point-by-point
        for (i = 0; i < popDataA.nRows; i++)
        {
            tempVector.clear();

            for (j = 0; j<popDataA.nCol; j++)
            {
                gridFile >> dum;
                tempVector.push_back(dum);
            }
            popDataA.popCount.push_back(tempVector);
        }
        //close ascii grid file
        gridFile.close();
    }
    else
    {
        cout << "Population file read failed!" << endl;
    }

    //create/open binary file
    ofstream ofs(gridFname + ".bin", ios::trunc | ios::binary);
    if (ofs.is_open())
    {
        //write struct to binary file then close binary file
        ofs.write((char *)&popDataA, sizeof(popDataA));
        ofs.close();
    }
    else cout << "error writing to binary file" << endl;

    //read data from binary file into popDataB struct
    ifstream ifs(gridFname + ".bin", ios::binary);
    if (ifs.is_open())
    {
        ifs.read((char *)&popDataB, sizeof(popDataB));
        ifs.close();
    }
    else cout << "error reading from binary file" << endl;

    //compare results of reading in from the ascii file and reading in from the binary file
    cout << "File Header Values:\n";
    cout << "Columns (ascii vs binary): " << popDataA.nCol << " vs. " << popDataB.nCol << endl;
    cout << "Rows (ascii vs binary):" << popDataA.nRows << " vs." << popDataB.nRows << endl;

    cout << "Spot Check Vector Values: " << endl;
    cout << "Index 0,0: " << popDataA.popCount[0][0] << " vs. " << popDataB.popCount[0][0] << endl;
    cout << "Index 3431,8639: " << popDataA.popCount[3431][8639] << " vs. " << popDataB.popCount[3431][8639] << endl;
    cout << "Index 1600,4320: " << popDataA.popCount[1600][4320] << " vs. " << popDataB.popCount[1600][4320] << endl;

    return 0;
}

Here is the output when I both write and read the binary file in the same run:

File Header Values:
Columns (ascii vs binary): 8640 vs. 8640
Rows (ascii vs binary):3432 vs.3432
Spot Check Vector Values: 
Index 0,0: 0 vs. 0
Index 3431,8639: 0 vs. 0
Index 1600,4320: 25.2184 vs. 25.2184
a.out(11402,0x7fff77c25310) malloc: *** error for object 0x7fde9821c000: pointer being freed was not allocated
*** set a breakpoint in malloc_error_break to debug
Abort trap: 6

And here is the output I get if I just try to read from the pre-existing binary file:

File Header Values:
Columns (binary): 8640
Rows (binary):3432
Spot Check Vector Values: 
Segmentation fault: 11

Thanks in advance for any help!

回答1:

When you write popDataA to the file, you are writing the binary representation of the vector of vectors. However this really is quite a small object, consisting of a pointer to the actual data (itself a series of vectors, in this case) and some size information.

When it's read back in to popDataB, it kinda works! But only because the raw pointer that was in popDataA is now in popDataB, and it points to the same stuff in memory. Things go crazy at the end, because when the memory for the vectors is freed, the code tries to free the data referenced by popDataA twice (once for popDataA, and once again for popDataB.)

The short version is, it's not a reasonable thing to write a vector to a file in this fashion.

So what to do? The best approach is to first decide on your data representation. It will, like the ASCII format, specify what value gets written where, and will include information about the matrix size, so that you know how large a vector you will need to allocate when reading them in.

In semi-pseudo code, writing will look something like:

int nrow=...;
int ncol=...;
ofs.write((char *)&nrow,sizeof(nrow));
ofs.write((char *)&ncol,sizeof(ncol));
for (int i=0;i<nrow;++i) {
    for (int j=0;j<ncol;++j) {
        double val=data[i][j];
        ofs.write((char *)&val,sizeof(val));
    }
}

And reading will be the reverse:

ifs.read((char *)&nrow,sizeof(nrow));
ifs.read((char *)&ncol,sizeof(ncol));
// allocate data-structure of size nrow x ncol
// ...
for (int i=0;i<nrow;++i) {
    for (int j=0;j<ncol;++j) {
        double val;
        ifs.read((char *)&val,sizeof(val));
        data[i][j]=val;
    }
}

All that said though, you should consider not writing things into a binary file like this. These sorts of ad hoc binary formats tend to live on, long past their anticipated utility, and tend to suffer from:

Lack of documentation
Lack of extensibility
Format changes without versioning information
Issues when using saved data across different machines, including endianness problems, different default sizes for integers, etc.

Instead, I would strongly recommend using a third-party library. For scientific data, HDF5 and netcdf4 are good choices which address all of the above issues for you, and come with tools that can inspect the data without knowing anything about your particular program.

Lighter-weight options include the Boost serialization library and Google's protocol buffers, but these address only some of the issues listed above.

来源：https://stackoverflow.com/questions/28886899/writing-reading-large-vectors-of-data-to-binary-file-in-c

标签

c++

vector

binaryfiles