问题
I want to run a function on rows of a pandas dataframe in list comprehension. Dataframe can have varying number of columns. How to make use these columns of dataframe?
import pandas as pd
df = {'chrom': ['chr1', 'chr1','chr1'], 'start': [10000, 10100, 12000], 'end':[10150,10120,12250], 'S1':[1, 1, 1],'S2':[2, 2, 2],'S3':[3, 3, 3] }
df = pd.DataFrame(data=df)
print(df)
def func(row):
print(row)
[func(row) for row in zip(df['chrom'],df['start'],df['S1'],df['S2'],df['S3'])]
How to do this in a memory efficient way? So that we do not get any memory error for big dataframes.
回答1:
The shown code is extremely memory efficient, and should be faster than an iterrow based solution.
But from your comment, it is not the code that causes the memory error... The problematic codes are:
df[list(df.columns.values)].values()
or:
df[list(df.columns.values)].to_numpy(copy=False)
because both involves a full copy of the dataframe values unless all columns have the same dtype.
If you want to process an unknown number of columns, the safe way is:
[func(row) for row in zip([df[i].values for i in df.columns])]
No copy is required here because df[i].values will return the underlying numpy arrays.
By the way, if you only need to use once the values of the returned list you could even save some memory by using a generator instead of a list:
(func(row) for row in zip([df[i].values for i in df.columns]))
回答2:
Thanks for your answers.
Meantime, I found the following as a solution:
df_columns = list(df.columns.values)
[func_using_list_comp(
row,
var1,
var2,
var3,
...,
df_columns) for row in df[df_columns].values]
In this way, I did not need to use zip function and make it work for any number of columns.
I hope this is also memory efficient. By the way, I'm accumulating in the var1, var2, var3 each time I process a row.
If I use generator instead of a list, how much will it affect my memory usage and will I get the all the accumulated data after processing all rows?
Since, I'm returning these var1, var2, var3 after all rows are processed.
回答3:
Your list comprehension method seems a bit more confusing than it needs to be, especially considering pandas dataframes have an iterrows() method. You can replace your version with this:
for index, row in df.iterrows():
func(row)
But I only suggest the above method because your function seems to only print out the row. Depending on what your func really does, you may want to consider using df.apply():
df.apply(func, axis=1)
回答4:
In your example, printing the full row, the [0] or * is simply to remove the numpy frame again:
[func(*row) for row in zip(df[['chrom','start','S1','S2','S3']].to_numpy())]
or
[func(row[0]) for row in zip(df[['chrom','start','S1','S2','S3']].to_numpy())]
['chr1' 10000 1 2 3]
['chr1' 10100 1 2 3]
['chr1' 12000 1 2 3]
printing only the third column:
[func(row[0][2]) for row in zip(df[['chrom','start','S1','S2','S3']].to_numpy())]
1
1
1
p.s.: this also has the console output [None, None, None] in the end, but that is just because the result of print() inside the list comprehension is None, it does not belong to the print results.
See also:
- Pandas list comprehension tuple from dataframe
- list comprehension in pandas
来源:https://stackoverflow.com/questions/58567199/memory-efficient-way-for-list-comprehension-of-pandas-dataframe-using-multiple-c