Order string sequences within a cell

问题

I have the following data in a column of a Pandas dataframe:

col_1 ,B91-10,B7A-00,B7B-00,B0A-01,B0A-00,B64-03,B63-00,B7B-01 ,B8A-01,B5H-02,B32-02,B57-00 ,B83-01,B83-00,B5H-00 ,B83-01,B83-00 ,B83-00,B83-01 ,B83-00,B92-00,B92-01,B0N-02 ,B91-16

FYI: each of these strings begins with a comma, so the above example has 7 rows.

The order of these different codes in a row do not matter. Rows 3 and 4 (assuming index starts at 0) are identical for my purpose.

I need to order these different codes in each row so that I can get accurate counts of each of them.

In other words, I need to turn it into this:

col_1 B0A-00,B0A-01,B63-00,B64-03,B7A-00,B7B-00,B7B-01,B91-10 B32-02,B57-00,B5H-02,B8A-01 B5H-00,B83-00,B83-01 B83-00,B83-01 B83-00,B83-01 B0N-02,B83-00,B92-00,B92-01 B91-16

Not sure where to begin because the strings differ in the number of values. I tried splitting based on the comma but then had no idea how to sort columns when the rows have different numbers of values across the columns.

Thanks in advance.

回答1:

Option 1
If you want to sort these lexicographically, split on comma and then use np.sort:

v = np.sort(df.col_1.str.split(',', expand=True).fillna(''), axis=1)
df = pd.DataFrame(v).agg(','.join, 1).str.strip(',')

df

0    B0A-00,B0A-01,B63-00,B64-03,B7A-00,B7B-00,B7B-...
1                          B32-02,B57-00,B5H-02,B8A-01
2                                 B5H-00,B83-00,B83-01
3                                        B83-00,B83-01
4                                        B83-00,B83-01
5                          B0N-02,B83-00,B92-00,B92-01
6                                               B91-16

Option 2
Split on comma and call apply + sorted:

df.col_1.str.split(',').apply(sorted, 1).str.join(',').str.strip(',')

0    B0A-00,B0A-01,B63-00,B64-03,B7A-00,B7B-00,B7B-...
1                          B32-02,B57-00,B5H-02,B8A-01
2                                 B5H-00,B83-00,B83-01
3                                        B83-00,B83-01
4                                        B83-00,B83-01
5                          B0N-02,B83-00,B92-00,B92-01
6                                               B91-16
Name: col_1, dtype: object

Thanks to @Dark for the improvement!

来源：https://stackoverflow.com/questions/49164066/order-string-sequences-within-a-cell

标签

python

string

pandas