问题
Let's say I'm opening a large (several GB) file where I cannot read in the entire file as once.
If it's a csv file, we would use:
for chunk in pd.read_csv('path/filename', chunksize=10**7):
# save chunk to disk
Or we could do something similar with pandas:
import pandas as pd
with open(fn) as file:
for line in file:
# save line to disk, e.g. df=pd.concat([df, line_data]), then save the df
How does one "chunk" data with an awk script? Awk will parse/process text into a format you desire, but I don't know how to "chunk" with awk. One can write a script script1.awk and then process your data, but this processes the entire file at once.
Related question, with more concrete example: How to preprocess and load a "big data" tsv file into a python dataframe?
回答1:
awk reads a single record (chunk) at a time by design. By default a record is line of data, but you can specify a record using the RS (record separator) variable. Each code block is conditionally executed on the current record before the next is read:
$ awk '/pattern/{print "MATCHED", $0 > "output"}' file
The above script will read a line at a time from the input file and if the that line matchs pattern it will save the line in the file output prepended with MATCHED before reading the next line.
来源:https://stackoverflow.com/questions/39870135/how-to-process-and-save-data-in-chunks-using-awk