How do I use .loc with groupby so that creating a new column based on grouped data won't be considered a copy?

问题

I have a CSV file with groups of data, and am using the groupby() method to segregate them. Each group is processed by a bit of simple math that includes the use of min() and max() for a couple of columns, along with a bit of subtraction and multiplication to create a new column of data. I then graph each group. This mostly works okay, but I have two complaints about my code - graphs are individual, not combined as I would prefer; I get "SettingWithCopyWarning" with each group. From my searching, I believe the solution is either with the use of .loc or with a better split-apply (and possibly combine) method. I can do this in Excel, but am trying to learn Python and, while my code is functioning, I'd like to improve it.

import os.path
import sys
import pandas as pd

filename = "data/cal_data.csv"
df = pd.read_csv(filename, header=0) #one line of headers

df['Test']="Model "+df['Model No'] +", SN "+ df['Serial No'].values.astype(str) +", Test time "+ df['Test time'].values.astype(str) # combining several columns into a single column that makes grouping straight-forward, and simplifies titles of graphs. Not completely necessary.

df = df[df.index <= df.groupby('Test')['Test Point'].transform('idxmax')]#drop rows after each max test point

for title, group in df.groupby('Test'):
    x1, x2 = min(group["Test Reading"]),max(group["Test Reading"])
    x4, x3 = max(group["Test Point"]),min(group["Test Point"]) #min is usually zero
    R=(x2-x1)/(x4-x3) #linearize
    
    group['Test Point Error']=100*(group['Test Reading']- (group['Test Point']*R+x1))
    
    ax=group.plot(x='Test Point', y='Test Point Error', title=title, grid=True)
    ax.set_ylabel("% error (+/-"+str(Error_Limit)+"% limit)")

output error:

SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

edit- added output from df.head(20), and an image of a couple of plots:

 Test Point Test Reading    Test
0   0   0.10453 Model LC-500, SN 937618, Test time 17:20:10
1   20  0.17271 Model LC-500, SN 937618, Test time 17:20:10
2   50  0.27838 Model LC-500, SN 937618, Test time 17:20:10
3   100 0.45596 Model LC-500, SN 937618, Test time 17:20:10
4   150 0.63435 Model LC-500, SN 937618, Test time 17:20:10
5   200 0.81323 Model LC-500, SN 937618, Test time 17:20:10
6   250 0.99252 Model LC-500, SN 937618, Test time 17:20:10
7   300 1.17222 Model LC-500, SN 937618, Test time 17:20:10
8   350 1.35219 Model LC-500, SN 937618, Test time 17:20:10
9   400 1.53260 Model LC-500, SN 937618, Test time 17:20:10
10  450 1.71312 Model LC-500, SN 937618, Test time 17:20:10
11  500 1.89382 Model LC-500, SN 937618, Test time 17:20:10
14  0   0.10468 Model LC-500, SN 937618, Test time 17:31:46
15  20  0.17284 Model LC-500, SN 937618, Test time 17:31:46
16  50  0.27856 Model LC-500, SN 937618, Test time 17:31:46
17  100 0.45609 Model LC-500, SN 937618, Test time 17:31:46
18  150 0.63457 Model LC-500, SN 937618, Test time 17:31:46
19  200 0.81341 Model LC-500, SN 937618, Test time 17:31:46
20  250 0.99277 Model LC-500, SN 937618, Test time 17:31:46
21  300 1.17237 Model LC-500, SN 937618, Test time 17:31:46

Edit/update 7/23/2020: I made a couple of workarounds that make this work, but I would still appreciate any help. Here is the revised for loop code, writing each group to a new csv file to read later (this way I can add the new column created here), also removing the temporary file if it exists already:

if os.path.exists("data/temp.csv"):
    os.remove("data/temp.csv")
for title, group in df.groupby('Test'):

    x1 = min(group["Test Reading"].head(1))
    x2 = max(group["Test Reading"].tail(1))
    x3 = min(group["Test Point"].head(1))
    x4 = max(group["Test Point"].tail(1))
    R=(x2-x1)/(x4-x3) #linearization scalar
    group['Test Point Error'] =100*(group['Test Reading']- (group['Test Point']*R+x1))/(x2-x1)
    file = open('data/temp.csv','a')
    group.to_csv('data/temp.csv', mode="a", index=False, columns=columns, header=False)#, header=True, index=True, index_label=None, mode='w', encoding=None, compression='infer', quoting=None, quotechar='"', line_terminator=None, chunksize=None, date_format=None, doublequote=True, escapechar=None, decimal='.'))
    file.close()

Then, reading the temporary csv, I used seaborn (import seaborn as sns and import matplotlib.pyplot as plt to plot multiple groups together, grouped by serial number, 4 subplots per row.

df = pd.read_csv('data/temp.csv', header=0)
df['Model/SN']=df['Model No']+" / "+df['Serial No'].values.astype(str)
g = sns.FacetGrid(df, col='Model/SN', hue='Test', col_wrap=4, sharey=False, sharex=False)

g.map(plt.axhline, y=Error_Limit, ls='--', c='red')
g.map(plt.axhline, y=-Error_Limit, ls='--', c='red')

g = g.map(sns.lineplot, 'Test Point', 'Test Point Error', ci=None)

Sum up- these fixes are not ideal; they are work-around solutions and I still get the "SettingWithCopyWarning" error.

回答1:

So you are asking for:

How to stop setting values to copies.
How to create a plot with a subplot for each group in matplotlib.

The "SettingWithCopyWarning" happens because you are creating a column and setting values on each group, which is itself a copy of some rows of the DataFrame. Instead of setting the values on each loop I would store 'Test_Point_Error' on a list of series and pd.concat(list) after exiting for-loop, then add that to the DF.

---Edit--- Try replacing:

group['Test Point Error']=100*(group['Test Reading']- (group['Test Point']*R+x1))

with

error_list.append(100 * (group['Test Reading']- (group['Test Point']*R+x1)))

This will append a series for each group, with Indexes matching df.index. When you're done it will have exactly one row of error for each row in df. Therefore after you exit for-loop:

df.assign(test_point_error=pd.concat(error_list))

Will match each row exactly regardless of any sorting on df.

---end of edit---

The subplotting issue is similar, you are plotting each group separately while looping. If you plot after exiting for-loop then

df.groupby().plot(subplots=True)

will return what you want.

On a separate topic, I would do away with the string concatenation for 'Test' and do:

df.groupby(['Model No', 'Serial No', 'Test Time'])

This might make your code a lot faster if there are many rows.

来源：https://stackoverflow.com/questions/62725942/how-do-i-use-loc-with-groupby-so-that-creating-a-new-column-based-on-grouped-da

标签

python

pandas

dataframe

pandas-groupby