I have a pandas dataframe as follows:
ticker account value date
aa assets 100,200 20121231, 20131231
bb liabilities 50, 150 20141231, 20131231
I would like to split df['value'] and df['date'] so that the dataframe looks like this:
ticker account value date
aa assets 100 20121231
aa assets 200 20131231
bb liabilities 50 20141231
bb liabilities 150 20131231
Would greatly appreciate any help.
You can first split columns, create Series by stack and remove whitespaces by strip:
s1 = df.value.str.split(',', expand=True).stack().str.strip().reset_index(level=1, drop=True)
s2 = df.date.str.split(',', expand=True).stack().str.strip().reset_index(level=1, drop=True)
Then concat both Series to df1:
df1 = pd.concat([s1,s2], axis=1, keys=['value','date'])
Remove old columns value and date and join:
print (df.drop(['value','date'], axis=1).join(df1).reset_index(drop=True))
ticker account value date
0 aa assets 100 20121231
1 aa assets 200 20131231
2 bb liabilities 50 20141231
3 bb liabilities 150 20131231
I'm noticing this question a lot. That is, how do I split this column that has a list into multiple rows? I've seen it called exploding. Here are some links:
So I wrote a function that will do it.
def explode(df, columns):
idx = np.repeat(df.index, df[columns[0]].str.len())
a = df.T.reindex_axis(columns).values
concat = np.concatenate([np.concatenate(a[i]) for i in range(a.shape[0])])
p = pd.DataFrame(concat.reshape(a.shape[0], -1).T, idx, columns)
return pd.concat([df.drop(columns, axis=1), p], axis=1).reset_index(drop=True)
But before we can use it, we need lists (or iterable) in a column.
Setup
df = pd.DataFrame([['aa', 'assets', '100,200', '20121231,20131231'],
['bb', 'liabilities', '50,50', '20141231,20131231']],
columns=['ticker', 'account', 'value', 'date'])
df
split value and date columns:
df.value = df.value.str.split(',')
df.date = df.date.str.split(',')
df
Now we could explode on either column or both, one after the other.
Solution
explode(df, ['value','date'])
Timing
I removed strip from @jezrael's timing because I could not effectively add it to mine. This is a necessary step for this question as OP has spaces in strings after commas. I was aiming at providing a generic way to explode a column given it already has iterables in it and I think I've accomplished that.
code
def get_df(n=1):
return pd.DataFrame([['aa', 'assets', '100,200,200', '20121231,20131231,20131231'],
['bb', 'liabilities', '50,50', '20141231,20131231']] * n,
columns=['ticker', 'account', 'value', 'date'])
small 2 row sample
medium 200 row sample
large 2,000,000 row sample
I wrote explode function based on previous answers. It might be useful for anyone who want to grab and use it quickly.
def explode(df, cols, split_on=','):
"""
Explode dataframe on the given column, split on given delimeter
"""
cols_sep = list(set(df.columns) - set(cols))
df_cols = df[cols_sep]
explode_len = df[cols[0]].str.split(split_on).map(len)
repeat_list = []
for r, e in zip(df_cols.as_matrix(), explode_len):
repeat_list.extend([list(r)]*e)
df_repeat = pd.DataFrame(repeat_list, columns=cols_sep)
df_explode = pd.concat([df[col].str.split(split_on, expand=True).stack().str.strip().reset_index(drop=True)
for col in cols], axis=1)
df_explode.columns = cols
return pd.concat((df_repeat, df_explode), axis=1)
example given from @piRSquared:
df = pd.DataFrame([['aa', 'assets', '100,200', '20121231,20131231'],
['bb', 'liabilities', '50,50', '20141231,20131231']],
columns=['ticker', 'account', 'value', 'date'])
explode(df, ['value', 'date'])
output
+-----------+------+-----+--------+
| account|ticker|value| date|
+-----------+------+-----+--------+
| assets| aa| 100|20121231|
| assets| aa| 200|20131231|
|liabilities| bb| 50|20141231|
|liabilities| bb| 50|20131231|
+-----------+------+-----+--------+
Because I'm too new, I'm not allowed to write a comment, so I write an "answer".
@titipata your answer worked really good, but in my opinion there is a small "mistake" in your code I'm not able to find for my self.
I work with the example from this question and changed just the values.
df = pd.DataFrame([['title1', 'publisher1', '1.1,1.2', '1'],
['title2', 'publisher2', '2', '2.1,2.2']],
columns=['titel', 'publisher', 'print', 'electronic'])
explode(df, ['print', 'electronic'])
publisher titel print electronic
0 publisher1 title1 1.1 1
1 publisher1 title1 1.2 2.1
2 publisher2 title2 2 2.2
As you see, in the column 'electronic' should be in row '1' the value '1' and not '2.1'.
Because of that, the hole DataSet would change. I hope someone could help me to find a solution for this.
来源:https://stackoverflow.com/questions/38651008/splitting-multiple-columns-into-rows-in-pandas-dataframe





