Running out of RAM when writing to a file line by line [Python]

风流意气都作罢 提交于 2019-12-13 05:49:36

问题


I have a data processing task on some large data. I run the script on EC2 using Python that looks something like the following:

with open(LARGE_FILE, 'r') as f:
    with open(OUTPUT_FILE, 'w') as out:
        for line in f:
            results = some_computation(line)
            out.write(json.dumps(results))
            out.write('\n')

I loop over the data line by line and write the results to another file line by line.

After running it for a few hours, I can't log in to the server. I would have to restart the instance to continue.

$ ssh ubuntu@$IP_ADDRESS
ssh_exchange_identification: read: Connection reset by peer

It's likely the server is running out of RAM. When writing to the file, RAM slowly creeps up. I am not sure why memory would be a problem when reading and writing line by line.

I have ample hard drive space.

I think closest to this issue: Does the Python "open" function save its content in memory or in a temp file?


回答1:


I was using SpaCy to do some preprocessing of text. Looks like using the tokenizer causes steady memory growth.

https://github.com/spacy-io/spaCy/issues/285



来源:https://stackoverflow.com/questions/37779050/running-out-of-ram-when-writing-to-a-file-line-by-line-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!