Handling Single File Extraction From Corrupted GZ (TAR)

匿名 (未验证) 提交于 2019-12-03 09:02:45

问题:

This is my first post on Stack Overflow, I have a question regarding extracting a single file from a TAR file using GZ compression. I'm not the best at Python so I may be doing this incorrectly, any help would be much appreciated.


Scenario:

Corrupted *.tar.gz file comes in, the first file in the GZ contains important information for obtaining the SN of the system. This can be used to identify the machine so that we can issue a notification to it's administrator that the file was corrupted.

The Problem:

Using the regular UNIX tar binary I am able to extract just the README file from the archive even though the archive is not complete and would return an error upon extracting it fully. However, in Python I am unable to extract just one file, it always returns an exception even if I'm specifying just the single file.

Current Workaround:

I'm using "os.popen" to use the UNIX tar binary in order to obtain just the README file.

Desired Solution:

To use the Python tarfile package to extract just the single file.

Example Error:

UNIX (Works):

[root@athena tmp]# tar -xvzf bundle.tar.gz README README  gzip: stdin: unexpected end of file tar: Unexpected EOF in archive tar: Error is not recoverable: exiting now [root@athena tmp]#  [root@athena tmp]# ls bundle.tar.gz  README 

Python:

>>> import tarfile >>> tar = tarfile.open("bundle.tar.gz") >>> data = tar.extractfile("README").read() Traceback (most recent call last):   File "<stdin>", line 1, in ?   File "/usr/lib64/python2.4/tarfile.py", line 1364, in extractfile     tarinfo = self.getmember(member)   File "/usr/lib64/python2.4/tarfile.py", line 1048, in getmember     tarinfo = self._getmember(name)   File "/usr/lib64/python2.4/tarfile.py", line 1762, in _getmember     members = self.getmembers()   File "/usr/lib64/python2.4/tarfile.py", line 1059, in getmembers     self._load()        # all members, we first have to   File "/usr/lib64/python2.4/tarfile.py", line 1778, in _load     tarinfo = self.next()   File "/usr/lib64/python2.4/tarfile.py", line 1588, in next     self.fileobj.seek(self.offset)   File "/usr/lib64/python2.4/gzip.py", line 377, in seek     self.read(1024)   File "/usr/lib64/python2.4/gzip.py", line 225, in read     self._read(readsize)   File "/usr/lib64/python2.4/gzip.py", line 273, in _read     self._read_eof()   File "/usr/lib64/python2.4/gzip.py", line 309, in _read_eof     raise IOError, "CRC check failed" IOError: CRC check failed >>> print data Traceback (most recent call last):   File "<stdin>", line 1, in ? NameError: name 'data' is not defined 

Python (Handling Exception):

>>> tar = tarfile.open("bundle.tar.gz") >>> try: ...     data = tar.extractfile("README").read() ... except: ...     pass ...  >>> print(data) Traceback (most recent call last):   File "<stdin>", line 1, in ? NameError: name 'data' is not defined 

回答1:

Using the manual Unix method it looks like gzip unzips the file up to the point where it breaks.

Python gzip(or tar) module exits as soon as it notices you have a corrupt archive due to the failed CRC check.

Just an idea but you could pre-process the damaged archives with gzip and re-compress them to correct the CRC.

gunzip < damaged.tar.gz | gzip > corrected.tar.gz 

This will give you a corrected.tar.gz which will now contain all the data until the point where the archive was broken. You should now be able to use python tar/gzip library without getting CRC exceptions.

Keep in mind this command will un-gzip and gzip the archive, which costs storage IO and CPU time and you shouldn't do it for all your archives.

In order to be efficient you should only run it in case you get the IOError: CRC check failed exception.



回答2:

You can do something like this -- attempt to decompress the gzip file into a temp file, then try extracting the magic file from that. In the following example I'm pretty agressive about trying to read the entire file -- depending on the block size of the gzipped data you can likely get away with reading at maximum 128-256k. My gut tells me that gzip works in maximum of 64k blocks but I make no promises.

This method does everything in memory without needing intermediate files / writing to disk, but it does keep the entire amount of decompressed data in memory as well so.. I'm not joking about fine-tuning this for your specific use-case.

#!/usr/bin/python  import gzip  import tarfile  import StringIO  # Depending on how your tar file is constructed, you might need to specify  # './README' as your magic_file  magic_file = 'README'  f = gzip.open('corrupt', 'rb')  t = StringIO.StringIO()  try:     while 1:         block = f.read(1024)         t.write(block)  except Exception as e:     print str(e)     print '%d bytes decompressed' % (t.tell())  t.seek(0)  tarball = tarfile.TarFile.open(name=None, mode='r', fileobj=t)  try:     magic_data = tarball.getmember(magic_file).tobuf()     # I didn't actually try this part, but in theory     # getmember returns a tarinfo object which you can     # use to extract the file       # search magic data for serial number or print out the     # file      print magic_data  except Exception as e:     print e 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!