How to process and save data in chunks using awk?

廉价感情. 提交于 2020-04-21 04:44:57

问题


Let's say I'm opening a large (several GB) file where I cannot read in the entire file as once.

If it's a csv file, we would use:

for chunk in pd.read_csv('path/filename', chunksize=10**7):
    # save chunk to disk

Or we could do something similar with pandas:

import pandas as pd
with open(fn) as file:
    for line in file:
        # save line to disk, e.g. df=pd.concat([df, line_data]), then save the df

How does one "chunk" data with an awk script? Awk will parse/process text into a format you desire, but I don't know how to "chunk" with awk. One can write a script script1.awk and then process your data, but this processes the entire file at once.

Related question, with more concrete example: How to preprocess and load a "big data" tsv file into a python dataframe?


回答1:


awk reads a single record (chunk) at a time by design. By default a record is line of data, but you can specify a record using the RS (record separator) variable. Each code block is conditionally executed on the current record before the next is read:

$ awk '/pattern/{print "MATCHED", $0 > "output"}' file

The above script will read a line at a time from the input file and if the that line matchs pattern it will save the line in the file output prepended with MATCHED before reading the next line.



来源:https://stackoverflow.com/questions/39870135/how-to-process-and-save-data-in-chunks-using-awk

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!