问题
I have a DataFrame such as this:
df = pd.DataFrame([['Col1Val', 'Col2Val', '[3, 31.1, -341.4, 54.13]']],
columns=['col1', 'col2','values'])
The only differences are I have a few million rows and column values
is a string of exactly 200 floats in each row, instead of 4 in my example.
The csv file containing this data is ~5 GB. However, this reduces when I load into pandas after converting the first 2 string columns into categories. Hence I am able to perform most manipulations (filtering, slicing, indexing) with no performance issues.
I need to expand the values
column of strings into separate columns of floats. So there will be 200 columns each containing a float. I made an attempt at performing this, but I consistently run out of memory. Theoretically, I think this should be possible line by line in a memory efficient way, since columns of floats should take less memory than many numbers in a string. What's a good algorithm for this?
My existing code is below for splitting values
column.
df['values'] = df['values'].str.replace('[','').str.replace(']','')
# code runs out of memory in next line!
df_values = pd.DataFrame([x.split(',') for x in df['values'].values.tolist()])
df_values[df_values.columns] = df_values[df_values.columns].apply(pd.to_numeric, errors='coerce')
df_values[df_values.columns] = df_values[df_values.columns].fillna(0.0)
df= df.drop('values', 1).join(df_values)
Expected result for my sample, which above code generates correctly for small number of rows:
df = pd.DataFrame([['Col1Val', 'Col2Val', 3.0, 31.1, -341.4, 54.13]],
columns=['col1', 'col2', 0, 1, 2, 3])
To labour my reasoning for why I'm hoping (wishing!) for a "memory decreasing" solution, floats
should normally take less space than string
:
from sys import getsizeof
getsizeof('334.34') #55
getsizeof(334.34) #24
getsizeof('-452.35614') #59
getsizeof(-452.35614) #24
回答1:
For smaller datasets: (see below if this method fails due to memory issue.)
You can also try this.
df['values'].str[1:-1].str.split(",", expand=True).astype(float)
The first str[1:-1]
operation removes the brackets.
str.split
will split the rest of the values by ,
and expand it into a dataframe (with the expand=True
)
0 1 2 3
0 3.0 31.1 -341.4 54.13
You can also split the string by [ , ]
df['values'].str.split(r"[\[,\]]", expand=True).astype(float)
but this will results in two extra columns
0 1 2 3 4 5
0 3 31.1 -341.4 54.13
Edit: (For large dataset.)
One might try to fix it from the reading data part.
df = pd.read_csv('test.csv', delimiter=',', quotechar='"')
Here, we change the quote char to "
such that the original quote char '
will be ignored. We then just split by ,
. Then, we will need to do some data preprocessing to fix the misparsed part.
Given my test.csv
being
c1,c2,v1,v2,v3,v4
'Col1Val', 'Col2Val', '[3, 31.1, -341.4, 54.13]'
'Col1Val', 'Col2Val', '[3, 31.1, -341.4, 54.13]'
'Col1Val', 'Col2Val', '[3, 31.1, -341.4, 54.13]'
The result of the read_csv
is
c1 c2 v1 v2 v3 v4
0 'Col1Val' 'Col2Val' '[3 31.1 -341.4 54.13]'
1 'Col1Val' 'Col2Val' '[3 31.1 -341.4 54.13]'
2 'Col1Val' 'Col2Val' '[3 31.1 -341.4 54.13]'
Now, we can use some str methods to fix each column. Note: if there is comma in c1
/c2
, the results would be wrong.
回答2:
Option 1
Parse your string column with ast.literal_eval
/pd.eval
(it's the easiest first step).
import ast
df['values'] = df['values'].apply(ast.literal_eval)
Next, flatten the last column and concat
enate with the remaining n - 1
columns.
i = df.iloc[:, :-1]
j = pd.DataFrame(df.iloc[:, -1].tolist())
pd.concat([i, j], 1)
col1 col2 0 1 2 3
0 Col1Val Col2Val 3 31.1 -341.4 54.13
Here's an improved version for efficiency. Use del
for inplace deletion of columns, and cut out all the slicing operations (they create copies, and are wasteful).
j = pd.DataFrame(df['values'].tolist())
del df['values']
pd.concat([df, j], 1)
col1 col2 0 1 2 3
0 Col1Val Col2Val 3 31.1 -341.4 54.13
Option 2str.extractall
(can't guarantee performance).
df = df.set_index(['col1', 'col2'])['values']\
.str.extractall('(\d+(?:\.\d*)?)')\
.unstack()
df.columns = df.columns.droplevel(0)
df.reset_index()
match col1 col2 0 1 2 3
0 Col1Val Col2Val 3 31.1 341.4 54.13
回答3:
You can use pop for extract column with apply
for convert to list
s and DataFrame
constructor:
df1 = df.join(pd.DataFrame(df.pop('values').apply(pd.io.json.loads).values.tolist()))
print (df1)
col1 col2 0 1 2 3
0 Col1Val Col2Val 3 31.1 -341.4 54.13
print (df1.dtypes)
col1 object
col2 object
0 int64
1 float64
2 float64
3 float64
dtype: object
来源:https://stackoverflow.com/questions/48458706/expanding-pandas-string-column-of-floats-memory-efficiently