问题
Short version
Running df2.groupby("EquipmentType").quantile([.1, .25, .5, .75,0.9,0.95,0.99])
on a dataset is sometimes giving me percentiles that appear to reset partway through my data. Why is this, and how can I avoid it?
Full version of code (but not the data) at the end.
Loaders 0.10 57.731806
0.25 394.004375
0.50 0.288889
0.75 7.201528
0.90 51.015667
0.95 83.949833
0.99 123.148019
Full version
I'm working through a large dataset (on the order of 2,500,000 rows) of equipment failure data. So imagine a condensed array, with 2 columns x 2,500,000 rows (this is a subset of a 40-something column dataset) that contains one row for every time a bit of equipment failed.
EquipmentType TTF
Pump 10
Conveyor 20
Crusher 15
...
<2,500,000 more entries>
...
Pump 5
Conveyor 20
Pump 40
Loader 33
TTF in this table stands for Time To Failure.
For context, I have placed this unsorted dataset into dataframe df2
I have generated typical descriptive statistics for each bit of equipment, like count, min, mean, max, etc, so I could see what was going on in my data.
Count Min Mean Max
EquipmentType
Pump 204136 0.000556 71.797146 23407.41667
CoffeeMachine 152248 0.001111 66.352893 22939.39306
It has a lot of... not quite outliers, but let's say it has some very big entries (maybe 5% of the failure data are very big numbers), and so I am looking for a bit more detail on what is going on. Enter quantiles.
I query the 10, 25, 50, 75, 90, 95 and 99 percentile values in my dataset, and print the result.
print(df2.groupby("EquipmentType").quantile([.1, .25, .5, .75,0.9,0.95,0.99]) )
These work well for some equipment types, but not for others. They should be ever-increasing, but for some equipment items, they suddenly "reset", and start from zero again.
Most look like this, with an array including the equipment type, the percentile being referred to, and the time to fail that matches that percentile...
Pumps 0.10 0.005556
0.25 0.238889
0.50 1.775000
0.75 2.595833
0.90 4.611389
0.95 7.008125
0.99 15.465278
But then one or two EquipmentTypes have an abrupt change where they start to consider smaller values than the previous quantile:
Loaders 0.10 57.731806
0.25 394.004375
0.50 0.288889 <-- What just happened here?
0.75 7.201528
0.90 51.015667
0.95 83.949833
0.99 123.148019
Conveyors 0.10 359.597167
0.25 850.714306
0.50 7328.187222
0.75 0.200000 <-- What just happened here?
0.90 0.375000
0.95 0.441667
0.99 0.500000
I have no idea why this could be happening, and I'd like it to stop.
I've checked that quantile doesn't need the data to be in any particular order.
I note that when the dataset is imported, it says that it has multiple data types. Everything so far seems to deal with that "potential" garbage (not sure what else could possibly be in there... shouldn't be anything non-numeric, at least).
I would like to only look at the entries that are double precision values , but maybe there are some stray things in there (I'm not sure how to pull out examples of those without trying to dump this into EXCEL, which would be a new challenge).
I'd appreciate anyone's thoughts on this.
Full version of code
from pathlib import Path
import pandas as pd
data_folder = Path("C:/Users/myName/Documents")
file_to_open = data_folder / "myData.csv"
dashboard_df=pd.read_csv(file_to_open, sep=',',encoding= 'unicode_escape',low_memory =False )
df2=dashboard_df[["CC_CauseLocationEquipmentType", "CC_TBF"]]
df3=df2[df2.CC_TBF.notnull()]
summaryStats=df3.groupby("CC_CauseLocationEquipmentType").count()
summaryStats["Min"]=df3.groupby("CC_CauseLocationEquipmentType").min()
summaryStats["Mean"]=df3.groupby("CC_CauseLocationEquipmentType").mean()
summaryStats["Max"]=df3.groupby("CC_CauseLocationEquipmentType").max()
summaryStats.rename(columns={"CC_TBF": "Count"}, errors="raise")
print("Unfiltered dataset")
print(summaryStats)
print("Quantiles")
print(df3.groupby("CC_CauseLocationEquipmentType").quantile([.1, .25, .5, .75,0.9,0.95,0.99]) )
来源:https://stackoverflow.com/questions/62254156/pandas-quantiles-misbehaving-by-getting-smaller-partway-through-a-range-of-pe