open warc file with python

拥有回忆 提交于 2019-12-09 13:15:16

问题


I'm trying to open a warc file with python using the toolbox from the following link: http://warc.readthedocs.org/en/latest/

When opening the file with:

import warc
f = warc.open("00.warc.gz")

Everything is fine and the f object is:

<warc.warc.WARCFile instance at 0x1151d34d0>

However when I'm trying to read everything in the file using:

for record in f:
     print record['WARC-Target-URI'], record['Content-Length']

The following error appears:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/xxx/anaconda/lib/python2.7/site-packages/warc/warc.py", line 390, in         __iter__
record = self.read_record()
File "/Users/xxx/anaconda/lib/python2.7/site-packages/warc/warc.py", line 373, in read_record
header = self.read_header(fileobj)
File "/Users/xxx/anaconda/lib/python2.7/site-packages/warc/warc.py", line 331, in read_header
raise IOError("Bad version line: %r" % version_line)
IOError: Bad version line: 'WARC/0.18\n'

Is this because my warc file version is not supported by the warc toolbox I'm using or something else?


回答1:


ClueWeb09 dataset is available in the WARC 0.18 format. However, it has several issues. Some records are malformed.

The most prevalent problem is an extra newline in the WARC header. There are a few cases of other malformed headers also.

Moreover, it does not use the standard \r\n end-of-line markers which is actually your problem.

warc-clueweb library can handle it. This is a special python library to work with ClueWeb09 WARC files. According to documentation

Only minor modifications to the original library were made. The original documentation of the warc library still holds




回答2:


Yes, thanks for @eyelash explanation about this problem.

Actually some records in Clueweb-09 are malformed. But the official warc library and the above recommended git repo warc-clueweb library both have some issues.

This fork repo could not handle Clueweb12 dataset and another issue is that it could miss 1-2 document when dealing every .warc.gz file.

So I've changed a little code to support both Clueweb09 and Cluewe12 datasets. Here is my repo which has been tested on 100 billion pages, my warc tools forked and changed from warc-clueweb library and official repo.



来源:https://stackoverflow.com/questions/25784825/open-warc-file-with-python

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!