Pandas quantiles misbehaving by… getting smaller partway through a range of percentiles?

问题

Short version

Running df2.groupby("EquipmentType").quantile([.1, .25, .5, .75,0.9,0.95,0.99]) on a dataset is sometimes giving me percentiles that appear to reset partway through my data. Why is this, and how can I avoid it?

Full version of code (but not the data) at the end.

Loaders                       0.10    57.731806
                              0.25   394.004375
                              0.50     0.288889
                              0.75     7.201528
                              0.90    51.015667
                              0.95    83.949833
                              0.99   123.148019

Full version

I'm working through a large dataset (on the order of 2,500,000 rows) of equipment failure data. So imagine a condensed array, with 2 columns x 2,500,000 rows (this is a subset of a 40-something column dataset) that contains one row for every time a bit of equipment failed.

EquipmentType         TTF
Pump                   10
Conveyor               20
Crusher                15
...
       <2,500,000 more entries>
...
Pump                    5
Conveyor               20
Pump                   40
Loader                 33

TTF in this table stands for Time To Failure.

For context, I have placed this unsorted dataset into dataframe df2

I have generated typical descriptive statistics for each bit of equipment, like count, min, mean, max, etc, so I could see what was going on in my data.

                               Count       Min        Mean          Max
EquipmentType
Pump                           204136  0.000556   71.797146  23407.41667
CoffeeMachine                  152248  0.001111   66.352893  22939.39306

It has a lot of... not quite outliers, but let's say it has some very big entries (maybe 5% of the failure data are very big numbers), and so I am looking for a bit more detail on what is going on. Enter quantiles.

I query the 10, 25, 50, 75, 90, 95 and 99 percentile values in my dataset, and print the result.

print(df2.groupby("EquipmentType").quantile([.1, .25, .5, .75,0.9,0.95,0.99]) )

These work well for some equipment types, but not for others. They should be ever-increasing, but for some equipment items, they suddenly "reset", and start from zero again.

Most look like this, with an array including the equipment type, the percentile being referred to, and the time to fail that matches that percentile...

Pumps                         0.10     0.005556
                              0.25     0.238889
                              0.50     1.775000
                              0.75     2.595833
                              0.90     4.611389
                              0.95     7.008125
                              0.99    15.465278

But then one or two EquipmentTypes have an abrupt change where they start to consider smaller values than the previous quantile:

Loaders                       0.10    57.731806
                              0.25   394.004375
                              0.50     0.288889   <-- What just happened here?
                              0.75     7.201528
                              0.90    51.015667
                              0.95    83.949833
                              0.99   123.148019

Conveyors                     0.10   359.597167
                              0.25   850.714306
                              0.50  7328.187222
                              0.75     0.200000   <-- What just happened here?
                              0.90     0.375000
                              0.95     0.441667
                              0.99     0.500000

I have no idea why this could be happening, and I'd like it to stop.

I've checked that quantile doesn't need the data to be in any particular order.

I note that when the dataset is imported, it says that it has multiple data types. Everything so far seems to deal with that "potential" garbage (not sure what else could possibly be in there... shouldn't be anything non-numeric, at least).

I would like to only look at the entries that are double precision values , but maybe there are some stray things in there (I'm not sure how to pull out examples of those without trying to dump this into EXCEL, which would be a new challenge).

I'd appreciate anyone's thoughts on this.

Full version of code

from pathlib import Path
import pandas as pd
data_folder = Path("C:/Users/myName/Documents")
file_to_open = data_folder / "myData.csv"

dashboard_df=pd.read_csv(file_to_open, sep=',',encoding= 'unicode_escape',low_memory =False )

df2=dashboard_df[["CC_CauseLocationEquipmentType", "CC_TBF"]]

df3=df2[df2.CC_TBF.notnull()]

summaryStats=df3.groupby("CC_CauseLocationEquipmentType").count()
summaryStats["Min"]=df3.groupby("CC_CauseLocationEquipmentType").min()
summaryStats["Mean"]=df3.groupby("CC_CauseLocationEquipmentType").mean()
summaryStats["Max"]=df3.groupby("CC_CauseLocationEquipmentType").max()
summaryStats.rename(columns={"CC_TBF": "Count"}, errors="raise")

print("Unfiltered dataset")
print(summaryStats)

print("Quantiles")
print(df3.groupby("CC_CauseLocationEquipmentType").quantile([.1, .25, .5, .75,0.9,0.95,0.99]) )

来源：https://stackoverflow.com/questions/62254156/pandas-quantiles-misbehaving-by-getting-smaller-partway-through-a-range-of-pe

标签

python

pandas

statistics