What is a good algorithm for compacting records in a blocked file?

陌路散爱 提交于 2019-12-01 22:11:51

This sounds like a variation of the bin packing problem, but where you already have an inferior allocation that you want to improve. So I suggest looking at variations of the approaches which are successful for the bin packing problem.

First of all, you probably want to parameterize your problem by defining what you consider "full enough" (where a block is full enough that you don't want to touch it), and what is "too empty" (where a block has so much empty space that it has to have more records added to it). Then, you can classify all the blocks as full enough, too empty, or partially full (those that are neither full enough nor too empty). You then redefine the problem as how to eliminate all the too empty blocks by creating as many full enough blocks as possible while minimising the number of partially full blocks.

You'll also want to work out what's more important: getting the records into the fewest blocks possible, or packing them adequately but with the least amount of blocks read and written.

My approach would be to make an initial pass over all the blocks, to classify them all into one of the three classes defined above. For each block, you also want to keep track of the free space available in it, and for the too empty blocks, you'll want to have a list of all the records and their sizes. Then, starting with the largest records in the too empty blocks, move them into the partially full blocks. If you want to minimise reads and writes, move them into any of the blocks you currently have in memory. If you want to minimise wasted space, find the block with the least amount of empty space that will still hold the record, reading the block in if necessary. If no block will hold the record, create a new block. If a block in memory reaches the "full enough" threshold, write it out. Repeat until all the records in the partially filled blocks have been placed.

I've skipped over more than a few details, but this should give you some idea. Note that the bin packing problem is NP-hard, which means that for practical applications, you'll want to decide what's most important to you, so you can pick an approach that will give you an approximately best solution in reasonable time.

If there is no ordering to these records, I'd simply fill the blocks from the front with records extracted from the last block(s). This will minimize movement of data, is fairly simple, and should do a decent job of packing data tightly.

E.g.:

// records should be sorted by size in memory (probably in a balanced BST)
records = read last N blocks on disk;

foreach (block in blocks) // read from disk into memory
{
    if (block.hasBeenReadFrom())
    {
        // we read from this into records already
        // all remaining records are already in memory

        writeAllToNewBlocks(records);

        // this will leave some empty blocks on the disk that can either
        // be eliminated programmatically or left alone and filled during
        // normal operation

        foreach (record in records)
        {
            record.eraseFromOriginalLocation();
        }

        break;
    }

    while(!block.full())
    {
        moveRecords = new Array; // list of records we've moved

        size = block.availableSpace();
        record = records.extractBestFit(size);
        if (record == null)
        {
            break;
        }

        moveRecords.add(record);
        block.add(record);

        if (records.gettingLow())
        {
            records.readMoreFromDisk();
        }
    }

    if(moveRecords.size() > 0)
    {
        block.writeBackToDisk();
        foreach (record in moveRecords)
        {
            record.eraseFromOriginalLocation();
        }
    }
}

Update: I neglected to maintain the no-blocks-only-in-memory rule. I've updated the pseudocode to fix this. Also fixed a glitch in my loop condition.

A modification of an on-line (to defragment in one pass) bounded space (the memory requirements) bin packing algorithm could probably work here.

See "Bin Packing Approximation Algorithms: Combinatorial Analysis" by Coffman et al.

Here's an algorithm you might be able to leverage, albeit your records within fixed size blocks might require a little bit more work.

Heap Defragmentation in Bounded Time

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!