Python cannot read “warc.gz” file completely

喜夏-厌秋 提交于 2019-12-04 18:11:22

It seems that the custom gzip handling in warc.gzip2.GzipFile, file splitting with warc.utils.FilePart and reading in warc.warc.WARCReader is broken as a whole (tested with python 2.7.9, 2.7.10 and 2.7.11). It stops short when it receives no data instead of a new header.

It would seem that basic stdlib gzip handles the catenated files just fine and so this should work as well:

import gzip
import warc

with gzip.open('my_test_file.warc.gz', mode='rb') as gzf:
    for record in warc.WARCFile(fileobj=gzf):
        print record.payload.read()
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!