How to find top n% of records in a column of a dataframe using R

浪子不回头ぞ 提交于 2021-02-04 09:29:08

问题


I have a dataset showing the exchange rate of the Australian Dollar versus the US dollar once a day over a period of about 20 years. I have the data in a data frame, with the first column being the date, and the second column being the exchange rate. Here's a sample from the data:

>data
             V1     V2
1    12/12/1983 0.9175
2    13/12/1983 0.9010
3    14/12/1983 0.9000
4    15/12/1983 0.8978
5    16/12/1983 0.8928
6    19/12/1983 0.8770
7    20/12/1983 0.8795
8    21/12/1983 0.8905
9    22/12/1983 0.9005
10   23/12/1983 0.9005

How would I go about displaying the top n% of these records? E.g. say I want to see the days and exchange rates for those days where the exchange rate falls in the top 5% of all exchange rates in the dataset?


回答1:


For the top 5%:

n <- 5
data[data$V2 > quantile(data$V2,prob=1-n/100),]



回答2:


For the top 5% also:

head(data[order(data$V2,decreasing=T),],.05*nrow(data))



回答3:


Another solution could be use for sqldf if the data is sorted based on the V1 value:

library(sqldf)
sqldf('SELECT * FROM df
       ORDER BY V1
       LIMIT (SELECT 0.05 * COUNT(*) FROM df)
      ') 

You can change the rate form 0.05 (5%) to any required rate.




回答4:


A dplyr solution could look like this:

obs <- nrow(data) 
data %>% filter(row_number() < obs * 0.05)

This only works if the data is sorted, but your question and example data implies this. If the data is unsorted, you will need to arrange it by the variable you're interested in:

data <- data %>% arrange(desc(V2))



来源:https://stackoverflow.com/questions/1563961/how-to-find-top-n-of-records-in-a-column-of-a-dataframe-using-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!