Reading large lz4 compressed JSON data set in Python 2.7

问题

I need to analyze a large data set that is distributed as a lz4 compressed JSON file.

The compressed file is almost 1TB. I'd prefer not to uncompress it to disk due to cost. Each "record" in the dataset is very small, but it is obviously not feasible to read the entire data set into memory.

Any advice on how to iterate through records in this large lz4 compressed JSON file in Python 2.7?

回答1:

As of version 0.19.1 of the python lz4 bindings, there is full support for buffered IO provided. So, you should be able to do something like:

import lz4.frame
chunk_size = 128 * 1024 * 1024
with lz4.frame.open('mybigfile.lz4', 'r') as file:
    chunk = file.read(size=chunk_size)
    # Do stuff with this chunk of data.

which will read in data from the file at around 128 MB at a time.

Aside: I am the maintainer of the python lz4 package - please do file issues on the project page if you have problems with the package, or if something is not clear in the documentation.

来源：https://stackoverflow.com/questions/45966508/reading-large-lz4-compressed-json-data-set-in-python-2-7

标签

python

json

python-2.7

lz4

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!