问题
I have trouble making pandas returning multiple columns when using apply.
Example:
import pandas as pd
import numpy as np
np.random.seed(1)
df = pd.DataFrame(index=range(2), columns=['a', 'b'])
df.loc[0] = [np.array((1,2,3))], 1
df.loc[1] = [np.array((4,5,6))], 1
df
a b
0 [[1, 2, 3]] 1
1 [[4, 5, 6]] 1
df2 = np.random.randint(1,9, size=(3,2))
df2
array([[4, 6],
[8, 1],
[1, 2]])
def example(x):
return np.transpose(df2) @ x[0]
df3 = df['a'].apply(example)
df3
0 [23, 14]
1 [62, 41]
I want df3 to have two columns with one element in each per column per row, not one column with both elements per row.
So I want somthing like
df3Wanted
col1 col2
0 23 14
1 62 41
Does anybody know how to fix this?
回答1:
Couple of changes are required to achieve this:
Update below function as below
def example(x):
return [np.transpose(df2) @ x[0]]
and perform below operation on df3
wantedDF3 = pd.concat(df3.apply(pd.DataFrame, columns=['col1','col2']).tolist())
print(wantedDF3) gives desired output:
col1 col2
0 40 12
0 97 33
Edit:
Another way to do the same thing, to avoid memory error issues:
Keep your example function and df3 as it is (same as question)
Now, just on top of that, use below code to generate wantedDF3
col1df = pd.DataFrame(df3.apply(lambda x: x[0]).values, columns=['col1'])
col2df = pd.DataFrame(df3.apply(lambda x: x[1]).values, columns=['col2'])
wantedDF3 = col1df.join(col2df)
回答2:
This is an answer to the comments of the first answer and concerns the issue of memory error. The following example uses data that gives memory error on my computer with all methods suggested so far (the first answer and the comments in the 1st answer), but it works with the code below:
import pandas as pd
import numpy as np
import time
np.random.seed(1)
nRows = 25000
nCols = 10000
numberOfChunks = 5
df = pd.DataFrame(index=range(nRows ), columns=range(1))
df2 = df.apply(lambda row: np.random.rand(nCols), axis=1)
for start, stop in zip(np.arange(0, nRows , int(round(nRows/float(numberOfChunks)))),
np.arange(int(round(nRows/float(numberOfChunks))), nRows + int(round(nRows/float(numberOfChunks))), int(round(nRows/float(numberOfChunks))))):
df2tmp = df2.iloc[start:stop]
if start == 0:
df3 = pd.DataFrame(df2tmp.tolist(), index=df2tmp.index).astype('float16')
continue
df3tmp = pd.DataFrame(df2tmp.tolist(), index=df2tmp.index).astype('float16')
df3 = pd.concat([df3, df3tmp])
来源:https://stackoverflow.com/questions/58392974/pandas-apply-multiple-columns-per-row-instead-of-list