C++ Binary File I/O Operations Slow Down… How DB Handle Binary Files?

问题

I am trying to make a simple file-based hash table. Here is my insert member function:

private: std::fstream f;  // std::ios::in | std::ios::out | std::ios::binary

public: void insert(const char* this_key, long this_value) {
    char* that_key;
    long that_value;
    long this_hash = std::hash<std::string>{}(this_key) % M;
    long that_hash;  // also block status

    long block = this_hash;
    long offset = block * BLOCK_SIZE;
    while (true) {
        this->f.seekg(offset);
        this->f.read((char*) &that_hash, sizeof(long));
        if (that_hash > -1) {  // -1 (by default) indicates a never allocated block
            this->f.read(that_key, BLOCK_SIZE);
            if (strcmp(this_key, that_key) == 0) {
                this->f.seekp(this->f.tellg());
                this->f.write((char*) &this_value, sizeof(long));
                break;
            } else {
                block = (block + 1) % M;  // linear probing
                offset = block * BLOCK_SIZE;
                continue;
            }
        } else {
            this->f.seekp(offset);
            this->f.write((char*) &this_hash, sizeof(long));  // as block status
            this->f.write(this_key, KEY_SIZE);
            this->f.write((char*) &this_value, sizeof(long));
            break;
        }
    }
}

Tests up to 10M key, value pairs with 50,000,017 blocks were fairly done. (Binary file size was about 3.8GB).

However, a test with 50M keys and 250,000,013 blocks extremely slows down... (Binary file size was more than 19GB in this case). 1,000 inserts usually takes 4~5ms but exceptionally take more than 2,000ms. It gets slower and slower then takes 40~150ms... (x10 ~ x30 slower...) I definitely have no idea...

What causes this exceptional binary file I/O slowing down?
Do seekg&seekp and other I/O operations are affected by file size? (I could not find any references about this question though...)
How key, value stores and databases avoid this I/O slow down?
How can I solve this problem?

回答1:

Data size

Usually disk drive block size are a power of 2 so if your data block size is also a power of 2, then you can essentially eliminate the case where a data block cross a disk block boundary.

In your case, a value of 64 bytes (or 32 bytes if you don't really need to store the hash) might somewhat perform a bit better.

Insertion order

The other thing you could do to improve performance is to do your insertion is increasing hash order to reduce the number of a time data must be loaded from the disk.

Generally when data is read or written to the disk, the OS will read/write a large chuck at a time (maybe 4k) so if your algorithm is written is a way to write data locally in time, you will reduce the number of time data must actually be read or written to the disk.

Given that you make a lot of insertion, one possibility would be to process insertion in batch of say 1000 or even 10000 key/value pair at a time. Essentially, you would accumulate data in memory and sort it and once you have enough item (or when you are done inserting), you will then write the data in order.

That way, you should be able to reduce disk access which is very slow. This is probably even more important if you are using traditional hard drive as moving the head is slow (in which case, it might be useful to defragment it). Also, be sure that your hard drive have more than enough free space.

In some case, local caching (in your application) might also be helpful particularly if you are aware how your data is used.

File size VS collisions

When you use an hash, you want to find the sweet spot between file size and collisions. If you have too much collisions, then you will waste lot of time and at some point it might degenerate when it become hard to find a free place for almost every insertion.

On the other hand, if your file is really very large, you might end up in a case where you might fill your RAM with data that is mainly empty and still need to replace data with data from the disk on almost all insertion.

For example, if your data is 20GB and you are able to load say 2 GB in memory, then if insert are really random, 90% of the time you might need real access to hard drive.

Configuration

Well options will depends on the OS and it is beyond the scope of a programming forum. If you want to optimize your computer, then you should look elsewhere.

Reading

It might be helpful to read about operating systems (file system, cache layer…) and algorithms (external sorting algorithms, B-Tree and other structures) to get a better understanding.

Alternatives

Extra RAM
Fast SSD
Multithreading (for ex. input and output threads)
Rewriting of the algorithm (for ex. to read/write a whole disk page at once)
Faster CPU / 64 bit computer
Using algorithms designed for such scenarios.
Using a database.
Profiling code
Tuning parameters

来源：https://stackoverflow.com/questions/59478987/c-binary-file-i-o-operations-slow-down-how-db-handle-binary-files

标签

c++

fstream

binaryfiles

c++-ios