Expanding pandas string column of floats memory-efficiently

陌路散爱 提交于 2019-12-11 13:51:38

问题


I have a DataFrame such as this:

df = pd.DataFrame([['Col1Val', 'Col2Val', '[3, 31.1, -341.4, 54.13]']],
                  columns=['col1', 'col2','values'])

The only differences are I have a few million rows and column values is a string of exactly 200 floats in each row, instead of 4 in my example.

The csv file containing this data is ~5 GB. However, this reduces when I load into pandas after converting the first 2 string columns into categories. Hence I am able to perform most manipulations (filtering, slicing, indexing) with no performance issues.

I need to expand the values column of strings into separate columns of floats. So there will be 200 columns each containing a float. I made an attempt at performing this, but I consistently run out of memory. Theoretically, I think this should be possible line by line in a memory efficient way, since columns of floats should take less memory than many numbers in a string. What's a good algorithm for this?

My existing code is below for splitting values column.

df['values'] = df['values'].str.replace('[','').str.replace(']','')

# code runs out of memory in next line!
df_values = pd.DataFrame([x.split(',') for x in df['values'].values.tolist()])

df_values[df_values.columns] = df_values[df_values.columns].apply(pd.to_numeric, errors='coerce')
df_values[df_values.columns] = df_values[df_values.columns].fillna(0.0)

df= df.drop('values', 1).join(df_values)

Expected result for my sample, which above code generates correctly for small number of rows:

df = pd.DataFrame([['Col1Val', 'Col2Val', 3.0, 31.1, -341.4, 54.13]],
                  columns=['col1', 'col2', 0, 1, 2, 3])

To labour my reasoning for why I'm hoping (wishing!) for a "memory decreasing" solution, floats should normally take less space than string:

from sys import getsizeof

getsizeof('334.34')      #55
getsizeof(334.34)        #24
getsizeof('-452.35614')  #59
getsizeof(-452.35614)    #24

回答1:


For smaller datasets: (see below if this method fails due to memory issue.)

You can also try this.

df['values'].str[1:-1].str.split(",", expand=True).astype(float)

The first str[1:-1] operation removes the brackets.

str.split will split the rest of the values by , and expand it into a dataframe (with the expand=True)

    0       1       2       3
0   3.0     31.1    -341.4  54.13

You can also split the string by [ , ]

df['values'].str.split(r"[\[,\]]", expand=True).astype(float)

but this will results in two extra columns

    0   1   2       3       4       5
0       3   31.1    -341.4  54.13   

Edit: (For large dataset.)

One might try to fix it from the reading data part.

df = pd.read_csv('test.csv', delimiter=',', quotechar='"')

Here, we change the quote char to " such that the original quote char ' will be ignored. We then just split by ,. Then, we will need to do some data preprocessing to fix the misparsed part.

Given my test.csv being

 c1,c2,v1,v2,v3,v4
'Col1Val', 'Col2Val', '[3, 31.1, -341.4, 54.13]'
'Col1Val', 'Col2Val', '[3, 31.1, -341.4, 54.13]'
'Col1Val', 'Col2Val', '[3, 31.1, -341.4, 54.13]'

The result of the read_csv is

    c1          c2          v1      v2      v3      v4
0   'Col1Val'   'Col2Val'   '[3     31.1    -341.4  54.13]'
1   'Col1Val'   'Col2Val'   '[3     31.1    -341.4  54.13]'
2   'Col1Val'   'Col2Val'   '[3     31.1    -341.4  54.13]'

Now, we can use some str methods to fix each column. Note: if there is comma in c1/c2, the results would be wrong.




回答2:


Option 1
Parse your string column with ast.literal_eval/pd.eval (it's the easiest first step).

import ast
df['values'] = df['values'].apply(ast.literal_eval)

Next, flatten the last column and concatenate with the remaining n - 1 columns.

i = df.iloc[:, :-1]
j = pd.DataFrame(df.iloc[:, -1].tolist())

pd.concat([i, j], 1)

     col1     col2  0     1      2      3
0  Col1Val  Col2Val  3  31.1 -341.4  54.13

Here's an improved version for efficiency. Use del for inplace deletion of columns, and cut out all the slicing operations (they create copies, and are wasteful).

j = pd.DataFrame(df['values'].tolist())
del df['values']

pd.concat([df, j], 1)

      col1     col2  0     1      2      3
0  Col1Val  Col2Val  3  31.1 -341.4  54.13

Option 2
str.extractall (can't guarantee performance).

df = df.set_index(['col1', 'col2'])['values']\
       .str.extractall('(\d+(?:\.\d*)?)')\
       .unstack()

df.columns = df.columns.droplevel(0)
df.reset_index()

match     col1     col2  0     1      2      3
0      Col1Val  Col2Val  3  31.1  341.4  54.13



回答3:


You can use pop for extract column with apply for convert to lists and DataFrame constructor:

df1 = df.join(pd.DataFrame(df.pop('values').apply(pd.io.json.loads).values.tolist()))
print (df1)

      col1     col2  0     1      2      3
0  Col1Val  Col2Val  3  31.1 -341.4  54.13

print (df1.dtypes)
col1     object
col2     object
0         int64
1       float64
2       float64
3       float64
dtype: object


来源:https://stackoverflow.com/questions/48458706/expanding-pandas-string-column-of-floats-memory-efficiently

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!