How does one find the start of the “Central Directory” in zip files?

问题

Wikipedia has an excellent description of the ZIP file format, but the "central directory" structure is confusing to me. Specifically this:

This ordering allows a ZIP file to be created in one pass, but it is usually decompressed by first reading the central directory at the end.

The problem is that even the trailing header for the central directory is variable length. How then, can someone get the start of the central directory to parse?

(Oh, and I did spend some time looking at APPNOTE.TXT in vain before coming here and asking :P)

回答1:

My condolences, reading the wikipedia description gives me the very strong impression that you need to do a fair amount of guess + check work:

Hunt backwards from the end for the 0x06054b50 end-of-directory tag, look forward 16 bytes to find the offset for the start-of-directory tag 0x02014b50, and hope that is it. You could do some sanity checks like looking for the comment length and comment string tags after the end-of-directory tag, but it sure feels like Zip decoders work because people don't put funny characters into their zip comments, filenames, and so forth. Based entirely on the wikipedia page, anyhow.

回答2:

I was implementing zip archive support some time ago, and I search last few kilobytes for a end of central directory signature (4 bytes). That works pretty good, until somebody will put 50kb text into comment (which is unlikely to happen. To be absolutely sure, you can search last 64kb + few bytes, since comment size is 16 bit). After that, I look up for zip64 end of central dir locator, that's easier since it has fixed structure.

回答3:

Here is a solution I have just had to roll out incase anybody needs this. This involves grabbing the central directory.

In my case I did not want any of the compression features that are offered in any of the zip solutions. I just wanted to know about the contents. The following code will return a ZipArchive of a listing of every entry in the zip.

It also uses a minimum amount of file access and memory allocation.

TinyZip.cpp

#include "TinyZip.h"
#include <cstdio>

namespace TinyZip
{
#define VALID_ZIP_SIGNATURE 0x04034b50
#define CENTRAL_DIRECTORY_EOCD 0x06054b50 //signature
#define CENTRAL_DIRECTORY_ENTRY_SIGNATURE 0x02014b50
#define PTR_OFFS(type, mem, offs) *((type*)(mem + offs)) //SHOULD BE OK 

    typedef struct {
        unsigned int signature : 32;
        unsigned int number_of_disk : 16;
        unsigned int disk_where_cd_starts : 16;
        unsigned int number_of_cd_records : 16;
        unsigned int total_number_of_cd_records : 16;
        unsigned int size_of_cd : 32;
        unsigned int offset_of_start : 32;
        unsigned int comment_length : 16;
    } ZipEOCD;

    ZipArchive* ZipArchive::GetArchive(const char *filepath)
    {
        FILE *pFile = nullptr;
#ifdef WIN32
        errno_t err;
        if ((err = fopen_s(&pFile, filepath, "rb")) == 0)
#else
        if ((pFile = fopen(filepath, "rb")) == NULL)
#endif
        {
            int fileSignature = 0;
            //Seek to start and read zip header
            fread(&fileSignature, sizeof(int), 1, pFile);
            if (fileSignature != VALID_ZIP_SIGNATURE) return false;

            //Grab the file size
            long fileSize = 0;
            long currPos = 0;

            fseek(pFile, 0L, SEEK_END);
            fileSize = ftell(pFile);
            fseek(pFile, 0L, SEEK_SET);

            //Step back the size of the ZipEOCD 
            //If it doesn't have any comments, should get an instant signature match
            currPos = fileSize;
            int signature = 0;
            while (currPos > 0)
            {
                fseek(pFile, currPos, SEEK_SET);
                fread(&signature, sizeof(int), 1, pFile);
                if (signature == CENTRAL_DIRECTORY_EOCD)
                {
                    break;
                }
                currPos -= sizeof(char); //step back one byte
            }

            if (currPos != 0)
            {
                ZipEOCD zipOECD;
                fseek(pFile, currPos, SEEK_SET);
                fread(&zipOECD, sizeof(ZipEOCD), 1, pFile);

                long memBlockSize = fileSize - zipOECD.offset_of_start;

                //Allocate zip archive of size
                ZipArchive *pArchive = new ZipArchive(memBlockSize);

                //Read in the whole central directory (also includes the ZipEOCD...)
                fseek(pFile, zipOECD.offset_of_start, SEEK_SET);
                fread((void*)pArchive->m_MemBlock, memBlockSize - 10, 1, pFile);
                long currMemBlockPos = 0;
                long currNullTerminatorPos = -1;
                while (currMemBlockPos < memBlockSize)
                {
                    int sig = PTR_OFFS(int, pArchive->m_MemBlock, currMemBlockPos);
                    if (sig != CENTRAL_DIRECTORY_ENTRY_SIGNATURE)
                    {
                        if (sig == CENTRAL_DIRECTORY_EOCD) return pArchive;
                        return nullptr; //something went wrong
                    }

                    if (currNullTerminatorPos > 0)
                    {
                        pArchive->m_MemBlock[currNullTerminatorPos] = '\0';
                        currNullTerminatorPos = -1;
                    }

                    const long offsToFilenameLen = 28;
                    const long offsToFieldLen = 30;
                    const long offsetToFilename = 46;

                    int filenameLength = PTR_OFFS(int, pArchive->m_MemBlock, currMemBlockPos + offsToFilenameLen);
                    int extraFieldLen = PTR_OFFS(int, pArchive->m_MemBlock, currMemBlockPos + offsToFieldLen);
                    const char *pFilepath = &pArchive->m_MemBlock[currMemBlockPos + offsetToFilename];
                    currNullTerminatorPos = (currMemBlockPos + offsetToFilename) + filenameLength;
                    pArchive->m_Entries.push_back(pFilepath);

                    currMemBlockPos += (offsetToFilename + filenameLength + extraFieldLen);
                }

                return pArchive;
            }
        }
        return nullptr;
    }

    ZipArchive::ZipArchive(long size)
    {
        m_MemBlock = new char[size];
    }

    ZipArchive::~ZipArchive()
    {
        delete[] m_MemBlock;
    }

    const std::vector<const char*>  &ZipArchive::GetEntries()
    {
        return m_Entries;
    }
}

TinyZip.h

#ifndef __TinyZip__
#define __TinyZip__

#include <vector>
#include <string>

namespace TinyZip
{
    class ZipArchive
    {
    public:
        ZipArchive(long memBlockSize);
        ~ZipArchive();

        static ZipArchive* GetArchive(const char *filepath);

        const std::vector<const char*>  &GetEntries();

    private:
        std::vector<const char*> m_Entries;
        char *m_MemBlock;
    };

}


#endif

Usage:

 TinyZip::ZipArchive *pArchive = TinyZip::ZipArchive::GetArchive("Scripts_unencrypt.pak");
 if (pArchive != nullptr)
 {
     const std::vector<const char*> entries = pArchive->GetEntries();
     for (auto entry : entries)
     {
         //do stuff
     }
 }

回答4:

In case someone out there is still struggling with this problem - have a look at the repository I hosted on GitHub containing my project that could answer your questions.

Zip file reader Basically what it does is download the central directory part of the .zip file which resides in the end of the file. Then it will read out every file and folder name with it's path from the bytes and print it out to console.

I have made comments about the more complicated steps in my source code.

The program can work only till about 4GB .zip files. After that you will have to do some changes to the VM size and maybe more.

Enjoy :)

来源：https://stackoverflow.com/questions/4802097/how-does-one-find-the-start-of-the-central-directory-in-zip-files

标签

zip

file-format