问题
I have a dataframe (the sample looks like this)
Type SKU Description FullDescription Size Price
Variable 2 Boots Shoes on sale XL,S,M
Variation 2.5 Boots XL XL 330
Variation 2.6 Boots S S 330
Variation 2.7 Boots M M 330
Variable 3 Helmet Helmet Sizes E42,E41
Variation 3.8 Helmet E42 E42 89
Variation 3.2 Helmet E41 E41 89
What I want to do is sort the values based on Size so the final data frame should look like this:
Type SKU Description FullDescription Size Price
Variable 2 Boots Shoes on sale S,M,XL
Variation 2.6 Boots S S 330
Variation 2.7 Boots M M 330
Variation 2.5 Boots XL XL 330
Variable 3 Boots Helmet Sizes E41,E42
Variation 3.2 Helmet E41 E41 89
Variation 3.8 Helmet E42 E42 89
I am able to successfully get the results using this code
sizes, dig = ['S','M','XL','L',], ['000','111','333','222'] #make sure dig values do not exist as a substring anywhere in your dataframe
df = (df.assign(Size=df['Size'].replace(sizes, dig, regex=True))
.assign(grp=(df['Type'] == 'Variable').cumsum())
.sort_values(['grp', 'Type', 'Size']).drop('grp', axis=1))
df['Size'] = df['Size'].apply(lambda x: ','.join(sorted(x.split(',')))).replace(dig, sizes, regex=True)
df
The issue is that the given code dosen't work on dataframe
Type SKU Description FullDescription Size Price
Variable 2 Boots Shoes on sale XL,S,3XL
Variation 2.5 Boots XL XL 330
Variation 2.6 Boots 3XL 3XL 330
Variation 2.7 Boots S S 330
Variable 3 Helmet Helmet Sizes S19, S9
Variation 3.8 Helmet E42 S19 89
Variation 3.2 Helmet E41 S9 89
it gives the results 'S,3XL,XL' and 'S19,S9' whereas I want the results as
Type SKU Description FullDescription Size Price
Variable 2 Boots Shoes on sale S,XL,3XL
Variation 2.7 Boots S S 330
Variation 2.5 Boots XL XL 330
Variation 2.6 Boots 3XL 3XL 330
Variable 3 Helmet Helmet Sizes S9,S19
Variation 3.2 Helmet E41 S9 89
Variation 3.8 Helmet E42 S19 89
also in case of more sizes, the order should be 'XXS,XS,S,M,L,XL,XXL,3XL,4XL,5XL' and in case of second example, 'S9,S19,M9,M19,L9 and so on'
This is what I have done so far but it's not working and showing the wrong order
sizes, dig = ['XS','S','M','L','XL','XXL','3XL','4XL','5XL'], ['000','111','222','333','444','555','666','777','888'] #make sure dig values do not exist as a substring anywhere in your dataframe
df = (df.assign(Size=df['Size'].replace(sizes, dig, regex=True))
.assign(grp=(df['Type'] == 'variable').cumsum())
.sort_values(['grp', 'Type', 'Size']).drop('grp', axis=1))
df['Size'] = df['Size'].apply(lambda x: ','.join(sorted(x.split(',')))).replace(dig, sizes, regex=True)
回答1:
Step 1: recreate the data
import pandas as pd
#----------------------#
# Recreate the dataset #
#----------------------#
# raw input data_1 = """ Variable|2|Boots|Shoes on sale|XL,S,M|
Variation|2.5|Boots XL||XL|330 Variation|2.6|Boots S||S|330
Variation|2.7|Boots M||M|330 Variable|3|Helmet|Helmet Sizes|E42,E41|
Variation|3.8|Helmet E42||E42|89
Variation|3.2|Helmet E41||E41|89"""
data_2 = """ Variable|2|Boots|Shoes on sale|XL,S,3XL|
Variation|2.5|Boots XL||XL|330
Variation|2.6|Boots 3XL||3XL|330
Variation|2.7|Boots S||S|330
Variable|3|Helmet|Helmet Sizes|S19, S9|
Variation|3.8|Helmet E42||S19|89
Variation|3.2|Helmet E41||S9|89"""
# Construct 1 data set
data = 'Type|SKU|Description|FullDescription|Size|Price'
data += data_2 # this can also be data_1 or data_1 + data_2
# pre-process: split the lines and values into a list of lists.
data = [row.split('|') for row in data.split('\n')]
#-------------#
# create a df #
#-------------#
df = pd.DataFrame(data[1:], columns=data[0]) df
Temp Result
Type SKU Description FullDescription Size Price
0 Variable 2 Boots Shoes on sale XL,S,3XL
1 Variation 2.5 Boots XL XL 330
2 Variation 2.6 Boots 3XL 3XL 330
3 Variation 2.7 Boots S S 330
4 Variable 3 Helmet Helmet Sizes S19, S9
5 Variation 3.8 Helmet E42 S19 89
6 Variation 3.2 Helmet E41 S9 89
Step 2: Create a priority dict
I'm not really into fashion + I'm also a guy --> (I'm only familiar with S M L XL)
But feel free to re order them or add extra sizes into the list
# Prioritize the sizes
# ps, i don't know the order :)
priority_dict = {k : e for e, k in enumerate([ 'XXS','XS','S','M','L','XL','XXL','3XL','4XL','5XL', 'E41', 'E42', 'S9', 'S19' ])}
priority_dict
Temp Result
{'XXS': 0,
'XS': 1,
'S': 2,
'M': 3,
'L': 4,
'XL': 5,
'XXL': 6,
'3XL': 7,
'4XL': 8,
'5XL': 9,
'E41': 10,
'E42': 11,
'S9': 12,
'S19': 13}
Step 3: Create a list of tuples from the string of sizes
# Split the string "SIZE" into a list "XL,S,M" --> ["XL", "S", "M"]
# And, add the value from our priority dict to it --> [(5, "XL"), (2, "S"), (3, "M")]
# Last but not least, sort list (by the first value) --> [(2, "S"), (3, "M"), (5, "XL")]
df["TMP_SIZE"] = [ sorted([(priority_dict.get(size.strip()), size.strip()) for size in sizes.split(',')]) for sizes in df.Size]
df
Temp Result
Type SKU Description FullDescription Size Price TMP_SIZE
0 Variable 2 Boots Shoes on sale XL,S,3XL [(2, S), (5, XL), (7, 3XL)]
1 Variation 2.5 Boots XL XL 330 [(5, XL)]
2 Variation 2.6 Boots 3XL 3XL 330 [(7, 3XL)]
3 Variation 2.7 Boots S S 330 [(2, S)]
4 Variable 3 Helmet Helmet Sizes S19, S9 [(12, S9), (13, S19)]
5 Variation 3.8 Helmet E42 S19 89 [(13, S19)]
6 Variation 3.2 Helmet E41 S9 89 [(12, S9)]
Step 4: Clean TEMP_SIZE
# Create a new SIZE
# loop over the TMPS_SIZE and create a string from the second value of the tuplelist --> ', '.join( my_list )
df['NEW_SIZE'] = [', '.join([ size[1]for size in sizes ]) for sizes in df["TMP_SIZE"] ]
Temp Result
Type SKU Description ... Size Price TMP_SIZE NEW_SIZE
0 Variable 2 ... XL,S,3XL [(2, S), (5, XL), (7, 3XL)] S, XL, 3XL
1 Variation 2.5 ... XL 330 [(5, XL)] XL
2 Variation 2.6 ... 3XL 330 [(7, 3XL)] 3XL
3 Variation 2.7 ... S 330 [(2, S)] S
4 Variable 3 ... S19, S9 [(12, S9), (13, S19)] S9, S19
5 Variation 3.8 ... S19 89 [(13, S19)] S19
6 Variation 3.2 ... S9 89 [(12, S9)] S9
Step 5: grp
add your grp
#grp
df['grp']= (df['Type'] == 'Variable').cumsum()
df
Step 6: sort
in the last step, you can sort everything (I think that you need to sort the TMP_SIZE separately)
# sort the dataset
df = df.sort_values('TMP_SIZE') # notice that we sort on the list of tuples
df.sort_values(by=['grp', 'Type'])
来源:https://stackoverflow.com/questions/65616914/sorting-rows-in-python-pandas