I have large CSVs where I\'m only interested in a subset of the rows. In particular, I\'d like to read in all the rows which occur before a particular condition is met.
You could read the csv in chunks. Since pd.read_csv will return an iterator when the chunksize parameter is specified, you can use itertools.takewhile to read only as many chunks as you need, without reading the whole file.
import itertools as IT
import pandas as pd
chunksize = 10 ** 5
chunks = pd.read_csv(filename, chunksize=chunksize, header=None)
chunks = IT.takewhile(lambda chunk: chunk['B'].iloc[-1] < 10, chunks)
df = pd.concat(chunks)
mask = df['B'] < 10
df = df.loc[mask]
Or, to avoid having to use df.loc[mask] to remove unwanted rows from the last chunk, perhaps a cleaner solution would be to define a custom generator:
import itertools as IT
import pandas as pd
def valid(chunks):
for chunk in chunks:
mask = chunk['B'] < 10
if mask.all():
yield chunk
else:
yield chunk.loc[mask]
break
chunksize = 10 ** 5
chunks = pd.read_csv(filename, chunksize=chunksize, header=None)
df = pd.concat(valid(chunks))