What does data serialization do?

问题

I'm having a hard time understanding what serialization is and does.

Let me simplify my problem. I have a struct info in my c/c++ programs, and I may store this struct data into a file save.bin or send it via socket to another computer.

struct info {
    std::string name;
    int age;
};

void write_to_file()
{
    info a = {"Steve", 10};
    ofstream ofs("save.bin", ofstream::binary);
    ofs.write((char *) &a, sizeof(a));   // am I doing it right?
    ofs.close();
}

void write_to_sock()
{
    // I don't know about socket api, but I assume write **a** to socket is similar to file, isn't it?
}

write_to_file will simply save the struct info object a to disk, making this data persistent, right? And write it to socket is pretty much the same, right?

In the above code, I don't think I used data serialization, but the data a is made persistent in save.bin anyway, right?

Question

Then what's the point of serialization? Do I need it here? If yes, how should I use it?
I always think that any kind of files, .txt/.csv/.exe/..., are bits of 01 in memory, which means they have binary representation naturally, so can't we simply send these files via socket directly?

Code example is highly appreciated.

回答1:

but the data a is made persistent in save.bin anyway, right?

No! Your struct contains an std::string. The exact implementation (and the binary data you get with a cast to char* is not defined by the standard, but the actual string data will always resign somewhere outside of the class frame, heap-allocated, so you can't save that data this easily. With properly done serialisation, the string data is written to where the rest of the class also end up, so you will be able to read it back from a file. That's what you need serialisation for.

How to do it: you have to encode the string in some way, the easiest way is to first write its length, then the string itself. On reading back the file, first read back the length, then read that amount of bytes into a new string object.

I always think that any kind of files, .txt/.csv/.exe/..., are bits of 01 in memory

Yes, but the problem is that it's not universally defined which bit represents what part of a data structure. In particular, there are little-endian and big-endian architectures, they store the bits "the other way around". If you naïvely read out a file written in a mismatching architecture, you will obviously get garbage.

回答2:

Just writing down binary in-memory images is a form of serialization and for trivial cases it works. However in general you need to solve a few more problems that just dumping the memory doesn't consider:

1. Pointers

If the data contains any pointer of course you cannot just dump an load later as the memory address of where the pointers are pointing to will have no meaning once the program terminates and is restarted. Many objects have "hidden" pointers... for example there's no way to dump an std::vector in memory and reload it later correctly... sizeof on an std::vector clearly doesn't include the size of contained elements and therefore any structure containing an std::vector cannot be just dumped and reloaded. The same goes for std::string and all other std containers.

2. Portability

C and C++ struct and classes are not defined in terms of the bytes they occupy in memory, not portably that is. This means that a different compiler, a different compiler version or even the same version but with different compile options may generate code in which the structure layout in memory is not the same.

If you need serialization to just save and reload the data in the same program and that data it's not supposed to live long then memory dumping can indeed be used. Just think however about having millions of documents saved by just dumping structures and now the new compiler version (that you're forced to use because it's the only supported on the new OS version) has a different layout and you cannot load those documents any more.

In addition to same-system portability problems note also that even just a single integer can have a different in-memory representation on different systems. It may be larger or smaller; it may have a different byte ordering. Just using a memory dump means that what is saved cannot be loaded by another system. Not even a single integer.

3. Versioning

If the data you save will have a long lifespan then it's quite probable that you'll change the structures as the program evolves, for example you will add new fields, you will remove unused fields, you will change the general structure (e.g. changing a vector to a linked list).

If your format is just the memory images of current data structures it will be pretty hard to be able to later add for example a color field to a polygon object and having that the program can load old documents assuming as default color value the color that was used in the previous version.

Even writing a conversion program will be difficult because you will have old code able to load old documents and new code able to save new documents, but you cannot just "merge" the two and get a program that load old and saves new (i.e. both programs source code will have a polygon structure but with different fields, now what?).

回答3:

You're playing a game. On very hard mode. You reach the last level. You're happy. The 2 days of non-stop play are paying off. The plot will soon come to an end. You'll find the evil mastermind's motivation, how you got to be the hero and will collect the sought-after epic artifact that awaits behind that last door. And you got here without having to restart once.

Behind the scenes, there's a game object, that looks like this:

class GameState
{
   int level;
}

And the level is 25.

You really enjoyed the game so far, but you don't want to start all over in case the last boss kills you. So, intuitively, you press Ctrl+S. But wait, you get an error:

Sorry, saving is disabled.

What? So I have to start all over in case I die? How can this be.

Drumroll

The developers, albeit brilliant (they managed to keep you hooked for 2 straight days, right?) didn't implement serialization.

When you restart the game, memory clean-up takes place. That all-important GameState object, the one you spent 2 days to increase the level member to 25, is destroyed.

How could you fix this? The memory is reclaimed by the OS when you close the game. Where could you store it? On an external server? (sockets) On disk? (write to file)

Okay, why not.

class GameState
{
    int level;
    void save(const std::string& fileName)
    { /* write level to file */ }
    void load(const std::string& fileName)
    { /* read game state from file */ }
};

When you press Ctrl+s, the GameState object is saved to a file.

And, miraculously, when you load the game, the GameState object is read from that file. You no longer have to spend 2 days to get back to that last boss. You're already there.

Real answer:

Technically, writing serialization functionality is pretty difficult. I suggest you use a third-party. Google protocol buffers offers serialization that is cross-platform and even cross-language. Many others exist.

1.Then what's the point of serialization? Do I need it here? If yes, how should I use it?

As explained above, it stores state between runs, or between processes (possibly on different machines). Whether you needed or not depends on whether you need to store state and re-load it later.

2.I always think that any kind of files, .txt/.csv/.exe/..., are bits of 01 in memory, which means they have binary representation naturally, so can't we simply send these files via socket directly?

They are. But you don't want to modify the .exe whenever you play a new game.

回答4:

Your string won't be saved correctly. If you have different machines their representations of integers might differ, different programming languages won't have the same representations for strings for instance.

But when you have pointers to members, you wil save the pointer address and not the pointed to member, which means you have no way of getting that data from file again. What if your structure needs to change? All software that uses your data will need to change.

Yes you can send files via socket, but you will need some kind of protocol in order to make sure you know the name of the file and when you've reached the end of the file.

回答5:

Serialization does a lot of things. It supports persistence (being able to leave the program, then come back to it and obtain the same data), and communicating between processes and machines. It basically means converting your internal data to a sequence of bytes, and to be useful, you have to support deserialization as well: converting the sequence of bytes back into data.

When you do this, it's important to realize that internally to the program, data isn't just a sequence of bytes. It has format and structure: how a double is represented is different from one machine to the next, for example; and more complex objects, like std::string, aren't even in contiguous memory. So the first thing you have to do when you serialize is define how each type is represented as a sequence of bytes. If you're communicating with another program, both programs have to agree on this serial format; if it's just so that you can reread the data yourself, you can use any format you want (but I'd recommend using a pre-defined standard format, like XDR, if only to simplify the documentation).

What you cannot do is just dump out an image of the object in memory. Complex objects like std::string will have pointers in them, and these pointers will be meaningless in another process. And even the representation of simple types like double may change over time. (The migration from 32 bits to 64 resulted in the size of a long changing on most systems.) You must define a format, and then generate it byte by byte, from the data you have. To write XDR, for example, you might use something like this:

typedef std::vector<char> Buffer;

void
writeUInt( Buffer& dest, unsigned value )
{
    dest.push_back( (value >> 24) & 0xFF );
    dest.push_back( (value >> 16) & 0xFF );
    dest.push_back( (value >>  8) & 0xFF );
    dest.push_back( (value      ) & 0xFF );
}

void
writeInt( Buffer& dest, int value )
{
    writeUInt( dest, static_cast<unsigned>( value ) );
}

void
writeString( Buffer& dest, std::string const& value)
{
    assert( value.size() <= 0xFFFFFFFF );
    writeInt( dest, value.size() )
    std::copy( value.begin(), value.end(), std::back_inserter( dest ) );
    while ( dest.size() % 4 != 0 ) {
        dest.push_back( '\0' );
    }
}

回答6:

Aside from big edian or little endian there's the issue of how the data is packed for the given stucture for that program with that compiler. If you to save an entire structure, you can't use any pointers, you would have to replace it with a character buffer large enough for your needs. If the other machine is going to be the same architecture, then if you use #pragma pack(1) there won't be any gaps between the fields of your structure and you can ensure that the data will appear as if it were serialized, but without the size prefix for your string. You can skip the #pragma pack(1) if your certain that the other program that will read the data has the exact same settings for the same exact structure. Outside of that, the data won't match up.

If you serialize to memory first, you can speed up the serialization process. This can usually be accomplished with a buffer class and one templated function for most types.

template<typename T>
buffer& operator<<(T data)
{
    *(T*)buf = data;
    buf += sizeof(T);
}

Obviously you'll need specialized ones for strings and larger data types. You can use memcpy for large structures and passed in pointers to data. For strings, you'll want to prefix the length as was mentioned earlier.

For serious serialization needs, though, there's a lot more to consider.

来源：https://stackoverflow.com/questions/12001262/what-does-data-serialization-do

标签

c++

serialization