percentile

How do I calculate percentiles with python/numpy?

余生长醉 提交于 2019-12-17 04:41:48
问题 Is there a convenient way to calculate percentiles for a sequence or single-dimensional numpy array? I am looking for something similar to Excel's percentile function. I looked in NumPy's statistics reference, and couldn't find this. All I could find is the median (50th percentile), but not something more specific. 回答1: You might be interested in the SciPy Stats package. It has the percentile function you're after and many other statistical goodies. percentile() is available in numpy too.

vectorize percentile value of column B of column A (for groups)

谁说胖子不能爱 提交于 2019-12-13 12:59:08
问题 For every pair of src and dest airport cities I want to return a percentile of column a given a value of column b . I can do this manually as such: example df with only 2 pairs of src/dest (I have thousands in my actual df): dt src dest a b 0 2016-01-01 YYZ SFO 548.12 279.28 1 2016-01-01 DFW PDX 111.35 -65.50 2 2016-02-01 YYZ SFO 64.84 342.35 3 2016-02-01 DFW PDX 63.81 61.64 4 2016-03-01 YYZ SFO 614.29 262.83 {'a': {0: 548.12, 1: 111.34999999999999, 2: 64.840000000000003, 3: 63

How to implement percentile in Hive?

被刻印的时光 ゝ 提交于 2019-12-13 11:22:09
问题 Can anyone please tell me ,how to implement Percentile in Hive? I tried with percentile function,but not able to get the expected result. Example code will greatly help. 回答1: Use the percentile function, as per the product documentation: Returns the exact pth percentile of a column in the group (does not work with floating point types). p must be between 0 and 1. NOTE: A true percentile can only be computed for integer values. Use PERCENTILE_APPROX if your input is non-integral. If you are

Sorting portfolios based on criteria (top30%,Middle 40%. and Bottom 30%)

点点圈 提交于 2019-12-13 06:52:01
问题 Currently, I have the following table Company---------Date--------Exchange-------Size A---------------2000---------A-------------50 A---------------2001---------A------------ 100 B---------------2000---------B------------450 B---------------2001---------B------------- 458 I want to allocate each company into three categories "Top" ==> Top 30% "Middle" ==> Middle 40% "Bottom" ==> Bottom 30% Calculating cutoff values should be filtered with 'year' and 'Exchange'=A I have tried the following

New complexity to color coding based on percentile and another factor in ggplot

狂风中的少年 提交于 2019-12-13 05:15:13
问题 I would like to add another level of complexity to the color coding scheme I have going on in the below plot. I want to account for whether each of the values being plotted has passed a statistical test. So, the dots will only be color coded based on the percentile if they pass the test, otherwise, I would like the dot to be grey. Here is my code as I have it after all the helpful suggestions I received from my first post Color code points based on percentile in ggplot (note: this is some

python plot hist(graph) with percentile

放肆的年华 提交于 2019-12-13 04:15:15
问题 I have a couple of questions, and i tried but i couldn't solve it. Let me help. This is the question. There is an unknown histogram, and I want to guess the histogram(Not exact histogram is okay). I got some information about histogram. Given Information: min, max, size, mean, percentile(25%, 50%, 75%) I want to know how i can get the graph satisfying those conditions. Why this code doesn't work?? Thank you. -----------This is what i tried------------- import pandas as pd import numpy as np

AttributeError: 'module' object has no attribute 'percentile'

廉价感情. 提交于 2019-12-12 21:14:36
问题 I use this function to calculate percentile from here: import numpy as np a = [12, 3, 45, 0, 45, 47, 109, 1, 0, 3] np.percentile(a, 25) But I get this error : AttributeError: 'module' object has no attribute 'percentile' I also tried import numpy.percentile as np but it didn't I got the same error. my numpy version is 1.3.0 I tried to upgrade but it seems like it won't I used : [sudo pip install --upgrade scipy][2] but I found that there's no upgrade. my ubuntu version 9.10 my python version

How to add a column to a PySpark dataframe which contains the nth quantile of another column in the dataframe

情到浓时终转凉″ 提交于 2019-12-11 08:55:42
问题 I have a very large CSV file which has been imported as a PySpark dataframe: df . The dataframe contains many columns including column ireturn . I want to compute the 0.99 and 0.01 percentile of this column and then add another column to the dataframe df as new_col_99 and new_col_01 which contains the 0.99 and 0.01 percentile, respectively. I wrote the following codes which works for small dataframes but I get some errors when I apply it for my large dataframe. from pyspark.sql import

How to output different 25th, 50th, 75th percentiles in single Teradata query?

。_饼干妹妹 提交于 2019-12-11 08:38:08
问题 I had got stuck few hours back on around something similar and worked out a less messy code for outputting 25th, 50th, 75th percentiles in a single Teradata query. Can be further extended to produce a " 5 point summary ". For minimum and maximum change static values according to your population estimate. Somewhere someone had asked for an elegant approach. Sharing mine. Here's the code: SELECT MAX(PER_MIN) AS PER_MIN, MAX(PER_25) AS PER_25, MAX(PER_50) AS PER_50, MAX(PER_75) AS PER_75, MAX

R: Percentile calculations on subsets of data

断了今生、忘了曾经 提交于 2019-12-11 08:37:15
问题 I have a data set which contains the following identifiers, an rscore, gvkey, sic2, year, and cdom. What I am looking to do is calculate percentile ranks based on summed rscores for all temporal spans (~1500) for a given gvkey, and then calculate percentile ranks in a given temporal time span and sic2 based on gvkey. Calculating the percentiles for all temporal time spans is a fairly quick process, however once I add in calculating the sic2 percentile ranks it's fairly slow, but we are likely