问题
I have the following dummy dataframe:
df = pd.DataFrame({'Col1':['a,b,c,d', 'e,f,g,h', 'i,j,k,l,m'],
'Col2':['aa~bb~cc~dd', np.NaN, 'ii~jj~kk~ll~mm']})
Col1 Col2
0 a,b,c,d aa~bb~cc~dd
1 e,f,g,h NaN
2 i,j,k,l,m ii~jj~kk~ll~mm
The real dataset has shape 500000, 90.
I need to unnest these values to rows and I'm using the new explode method for this, which works fine.
The problem is the NaN, these will cause unequal lengths after the explode, so I need to fill in the same amount of delimiters as the filled values. In this case ~~~ since row 1 has three comma's.
expected output
Col1 Col2
0 a,b,c,d aa~bb~cc~dd
1 e,f,g,h ~~~
2 i,j,k,l,m ii~jj~kk~ll~mm
Attempt 1:
df['Col2'].fillna(df['Col1'].str.count(',')*'~')
Attempt 2:
np.where(df['Col2'].isna(), df['Col1'].str.count(',')*'~', df['Col2'])
This works, but I feel like there's an easier method for this:
characters = df['Col1'].str.replace('\w', '').str.replace(',', '~')
df['Col2'] = df['Col2'].fillna(characters)
print(df)
Col1 Col2
0 a,b,c,d aa~bb~cc~dd
1 e,f,g,h ~~~
2 i,j,k,l,m ii~jj~kk~ll~mm
d1 = df.assign(Col1=df['Col1'].str.split(',')).explode('Col1')[['Col1']]
d2 = df.assign(Col2=df['Col2'].str.split('~')).explode('Col2')[['Col2']]
final = pd.concat([d1,d2], axis=1)
print(final)
Col1 Col2
0 a aa
0 b bb
0 c cc
0 d dd
1 e
1 f
1 g
1 h
2 i ii
2 j jj
2 k kk
2 l ll
2 m mm
Question: is there an easier and more generalized method for this? Or is my method fine as is.
回答1:
pd.concat
delims = {'Col1': ',', 'Col2': '~'}
pd.concat({
k: df[k].str.split(delims[k], expand=True)
for k in df}, axis=1
).stack()
Col1 Col2
0 0 a aa
1 b bb
2 c cc
3 d dd
1 0 e NaN
1 f NaN
2 g NaN
3 h NaN
2 0 i ii
1 j jj
2 k kk
3 l ll
4 m mm
This loops on columns in df. It may be wiser to loop on keys in the delims dictionary.
delims = {'Col1': ',', 'Col2': '~'}
pd.concat({
k: df[k].str.split(delims[k], expand=True)
for k in delims}, axis=1
).stack()
Same thing, different look
delims = {'Col1': ',', 'Col2': '~'}
def f(c): return df[c].str.split(delims[c], expand=True)
pd.concat(map(f, delims), keys=delims, axis=1).stack()
回答2:
One way is using str.repeat and fillna() not sure how efficient this is though:
df.Col2.fillna(pd.Series(['~']*len(df)).str.repeat(df.Col1.str.count(',')))
0 aa~bb~cc~dd
1 ~~~
2 ii~jj~kk~ll~mm
Name: Col2, dtype: object
回答3:
Just split the dataframe into two
df1=df.dropna()
df2=df.drop(df1.index)
d1 = df1['Col1'].str.split(',').explode()
d2 = df1['Col2'].str.split('~').explode()
d3 = df2['Col1'].str.split(',').explode()
final = pd.concat([d1, d2], axis=1).append(d3.to_frame(),sort=False)
Out[77]:
Col1 Col2
0 a aa
0 b bb
0 c cc
0 d dd
2 i ii
2 j jj
2 k kk
2 l ll
2 m mm
1 e NaN
1 f NaN
1 g NaN
1 h NaN
回答4:
zip_longest can be useful here, given you don't need the original Index. It will work regardless of which column has more splits:
from itertools import zip_longest, chain
df = pd.DataFrame({'Col1':['a,b,c,d', 'e,f,g,h', 'i,j,k,l,m', 'x,y'],
'Col2':['aa~bb~cc~dd', np.NaN, 'ii~jj~kk~ll~mm', 'xx~yy~zz']})
# Col1 Col2
#0 a,b,c,d aa~bb~cc~dd
#1 e,f,g,h NaN
#2 i,j,k,l,m ii~jj~kk~ll~mm
#3 x,y xx~yy~zz
l = [zip_longest(*x, fillvalue='')
for x in zip(df.Col1.str.split(',').fillna(''),
df.Col2.str.split('~').fillna(''))]
pd.DataFrame(chain.from_iterable(l))
0 1
0 a aa
1 b bb
2 c cc
3 d dd
4 e
5 f
6 g
7 h
8 i ii
9 j jj
10 k kk
11 l ll
12 m mm
13 x xx
14 y yy
15 zz
来源:https://stackoverflow.com/questions/57774352/fill-in-same-amount-of-characters-where-other-column-is-nan