Python - Extracting files from a large (6GB+) zip file

懵懂的女人 提交于 2019-12-03 08:52:13

The problem is that you have a corrupted zip file. I can add more details about the corruption below, but first the practical stuff:

You can use this code snippet to tell you which member within the archive is corrupted. However, print z.testzip() would already tell you the same thing. And zip -T or unzip on the command line should also give you that info with the appropriate verbosity.


So, what do you do about it?

Well, obviously, if you can get an uncorrupted copy of the file, do that.

If not, if you want to just skip over the bad file and extract everything else, that's pretty easy—mostly the same code as the snippet linked above:

with open(sys.argv[1], 'rb') as zf:
    z = zipfile.ZipFile(zf, allowZip64=True)
    for member in z.infolist():
        try:
            z.extract(member)
        except zipfile.error as e:
            # log the error, the member.filename, whatever

The Bad magic number for file header exception message means that zipfile was able to successfully open the zipfile, parse its directory, find the information for a member, seek to that member within the archive, and read the header of that member—all of which means you probably have no zip64-related problems here. However, when it read that header, it did not have the expected "magic" signature of PK\003\004. That means the archive is corrupted.

The fact that the corruption happens to be at exactly 4294967296 implies very strongly that you had a 64-bit problem somewhere along the chain, because that's exactly 2**32.


The command-line zip/unzip tool has some workarounds to deal with common causes of corruption that lead to problems like this. it looks like those workarounds may be working for this archive, given that you get a warning, but all of the files are apparently recovered. Python's zipfile library does not have those workarounds, and I doubt you want to write your own zip-handling code yourself…

However, that does open the door for two more possibilities:

First, zip might be able to repair the zipfile for you, using the -F of -FF flag. (Read the manpage, or zip -h, or ask at a site like SuperUser if you need help with that.)

And if all else fails, you can run the unzip tool from Python, instead of using zipfile, like this:

subprocess.check_output(['unzip', fname])

That gives you a lot less flexibility and power than the zipfile module, of course—but you're not using any of that flexibility anyway; you're just calling extractall.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!