Pandas: Memory error when using apply to split single column array into columns

拈花ヽ惹草 提交于 2019-12-13 19:10:26

问题


I am wondering if anybody has a quick fix for a memory error that appears when doing the same thing as in the below example on larger data?

Example:

import pandas as pd
import numpy as np

nRows = 2
nCols = 3

df = pd.DataFrame(index=range(nRows ), columns=range(1))

df2 = df.apply(lambda row: [np.random.rand(nCols)], axis=1)

df3 = pd.concat(df2.apply(pd.DataFrame, columns=range(nCols)).tolist())

It is when creating df3 I get memory error.

The DF's in the example:

df
     0
0  NaN
1  NaN

df2
0    [[0.6704675101784022, 0.41730480236712697, 0.5...
1    [[0.14038693859523377, 0.1981014890848788, 0.8...
dtype: object

df3
          0         1         2
0  0.670468  0.417305  0.558690
0  0.140387  0.198101  0.800745

回答1:


First I think working with lists in pandas is not good idea, if possible, you can avoid it.

So I believe you can simplify your code a lot:

nRows = 2
nCols = 3

np.random.seed(2019)
df3 = pd.DataFrame(np.random.rand(nRows, nCols))
print (df3)
          0         1         2
0  0.903482  0.393081  0.623970
1  0.637877  0.880499  0.299172



回答2:


Here's an example with a solution of the problem (note that in this example lists are not used in the columns, but arrays instead. This I cannot avoid, since my original problem comes with lists or array in a column).

import pandas as pd
import numpy as np
import time
np.random.seed(1)

nRows = 25000
nCols = 10000
numberOfChunks = 5

df = pd.DataFrame(index=range(nRows ), columns=range(1))

df2 = df.apply(lambda row: np.random.rand(nCols), axis=1)

for start, stop in zip(np.arange(0, nRows , int(round(nRows/float(numberOfChunks)))), 
                       np.arange(int(round(nRows/float(numberOfChunks))), nRows +  int(round(nRows/float(numberOfChunks))), int(round(nRows/float(numberOfChunks))))):
    df2tmp = df2.iloc[start:stop]
    if start == 0:
        df3 = pd.DataFrame(df2tmp.tolist(), index=df2tmp.index).astype('float16')
        continue
    df3tmp =  pd.DataFrame(df2tmp.tolist(), index=df2tmp.index).astype('float16')
    df3 = pd.concat([df3, df3tmp])


来源:https://stackoverflow.com/questions/58444745/pandas-memory-error-when-using-apply-to-split-single-column-array-into-columns

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!