问题
I have the following data in a column of a Pandas dataframe:
col_1
,B91-10,B7A-00,B7B-00,B0A-01,B0A-00,B64-03,B63-00,B7B-01
,B8A-01,B5H-02,B32-02,B57-00
,B83-01,B83-00,B5H-00
,B83-01,B83-00
,B83-00,B83-01
,B83-00,B92-00,B92-01,B0N-02
,B91-16
FYI: each of these strings begins with a comma, so the above example has 7 rows.
The order of these different codes in a row do not matter. Rows 3 and 4 (assuming index starts at 0) are identical for my purpose.
I need to order these different codes in each row so that I can get accurate counts of each of them.
In other words, I need to turn it into this:
col_1
B0A-00,B0A-01,B63-00,B64-03,B7A-00,B7B-00,B7B-01,B91-10
B32-02,B57-00,B5H-02,B8A-01
B5H-00,B83-00,B83-01
B83-00,B83-01
B83-00,B83-01
B0N-02,B83-00,B92-00,B92-01
B91-16
Not sure where to begin because the strings differ in the number of values. I tried splitting based on the comma but then had no idea how to sort columns when the rows have different numbers of values across the columns.
Thanks in advance.
回答1:
Option 1
If you want to sort these lexicographically, split on comma and then use np.sort
:
v = np.sort(df.col_1.str.split(',', expand=True).fillna(''), axis=1)
df = pd.DataFrame(v).agg(','.join, 1).str.strip(',')
df
0 B0A-00,B0A-01,B63-00,B64-03,B7A-00,B7B-00,B7B-...
1 B32-02,B57-00,B5H-02,B8A-01
2 B5H-00,B83-00,B83-01
3 B83-00,B83-01
4 B83-00,B83-01
5 B0N-02,B83-00,B92-00,B92-01
6 B91-16
Option 2
Split on comma and call apply
+ sorted
:
df.col_1.str.split(',').apply(sorted, 1).str.join(',').str.strip(',')
0 B0A-00,B0A-01,B63-00,B64-03,B7A-00,B7B-00,B7B-...
1 B32-02,B57-00,B5H-02,B8A-01
2 B5H-00,B83-00,B83-01
3 B83-00,B83-01
4 B83-00,B83-01
5 B0N-02,B83-00,B92-00,B92-01
6 B91-16
Name: col_1, dtype: object
Thanks to @Dark for the improvement!
来源:https://stackoverflow.com/questions/49164066/order-string-sequences-within-a-cell