Select the first row by group

ぐ巨炮叔叔 提交于 2019-11-26 00:47:59

问题


From a dataframe like this

test <- data.frame(\'id\'= rep(1:5,2), \'string\'= LETTERS[1:10])
test <- test[order(test$id), ]
rownames(test) <- 1:10

> test
    id string
 1   1      A
 2   1      F
 3   2      B
 4   2      G
 5   3      C
 6   3      H
 7   4      D
 8   4      I
 9   5      E
 10  5      J

I want to create a new one with the first row of each id / string pair. If sqldf accepted R code within it, the query could look like this:

res <- sqldf(\"select id, min(rownames(test)), string 
              from test 
              group by id, string\")

> res
    id string
 1   1      A
 3   2      B
 5   3      C
 7   4      D
 9   5      E

Is there a solution short of creating a new column like

test$row <- rownames(test)

and running the same sqldf query with min(row)?


回答1:


You can use duplicated to do this very quickly.

test[!duplicated(test$id),]

Benchmarks, for the speed freaks:

ju <- function() test[!duplicated(test$id),]
gs1 <- function() do.call(rbind, lapply(split(test, test$id), head, 1))
gs2 <- function() do.call(rbind, lapply(split(test, test$id), `[`, 1, ))
jply <- function() ddply(test,.(id),function(x) head(x,1))
jdt <- function() {
  testd <- as.data.table(test)
  setkey(testd,id)
  # Initial solution (slow)
  # testd[,lapply(.SD,function(x) head(x,1)),by = key(testd)]
  # Faster options :
  testd[!duplicated(id)]               # (1)
  # testd[, .SD[1L], by=key(testd)]    # (2)
  # testd[J(unique(id)),mult="first"]  # (3)
  # testd[ testd[,.I[1L],by=id] ]      # (4) needs v1.8.3. Allows 2nd, 3rd etc
}

library(plyr)
library(data.table)
library(rbenchmark)

# sample data
set.seed(21)
test <- data.frame(id=sample(1e3, 1e5, TRUE), string=sample(LETTERS, 1e5, TRUE))
test <- test[order(test$id), ]

benchmark(ju(), gs1(), gs2(), jply(), jdt(),
    replications=5, order="relative")[,1:6]
#     test replications elapsed relative user.self sys.self
# 1   ju()            5    0.03    1.000      0.03     0.00
# 5  jdt()            5    0.03    1.000      0.03     0.00
# 3  gs2()            5    3.49  116.333      2.87     0.58
# 2  gs1()            5    3.58  119.333      3.00     0.58
# 4 jply()            5    3.69  123.000      3.11     0.51

Let's try that again, but with just the contenders from the first heat and with more data and more replications.

set.seed(21)
test <- data.frame(id=sample(1e4, 1e6, TRUE), string=sample(LETTERS, 1e6, TRUE))
test <- test[order(test$id), ]
benchmark(ju(), jdt(), order="relative")[,1:6]
#    test replications elapsed relative user.self sys.self
# 1  ju()          100    5.48    1.000      4.44     1.00
# 2 jdt()          100    6.92    1.263      5.70     1.15



回答2:


What about

DT <- data.table(test)
setkey(DT, id)

DT[J(unique(id)), mult = "first"]

Edit

There is also a unique method for data.tables which will return the the first row by key

jdtu <- function() unique(DT)

I think, if you are ordering test outside the benchmark, then you can removing the setkey and data.table conversion from the benchmark as well (as the setkey basically sorts by id, the same as order).

set.seed(21)
test <- data.frame(id=sample(1e3, 1e5, TRUE), string=sample(LETTERS, 1e5, TRUE))
test <- test[order(test$id), ]
DT <- data.table(DT, key = 'id')
ju <- function() test[!duplicated(test$id),]

jdt <- function() DT[J(unique(id)),mult = 'first']


 library(rbenchmark)
benchmark(ju(), jdt(), replications = 5)
##    test replications elapsed relative user.self sys.self 
## 2 jdt()            5    0.01        1      0.02        0        
## 1  ju()            5    0.05        5      0.05        0         

and with more data

** Edit with unique method**

set.seed(21)
test <- data.frame(id=sample(1e4, 1e6, TRUE), string=sample(LETTERS, 1e6, TRUE))
test <- test[order(test$id), ]
DT <- data.table(test, key = 'id')
       test replications elapsed relative user.self sys.self 
2  jdt()            5    0.09     2.25      0.09     0.00    
3 jdtu()            5    0.04     1.00      0.05     0.00      
1   ju()            5    0.22     5.50      0.19     0.03        

The unique method is fastest here.




回答3:


I favor the dplyr approach.

group_by(id) followed by either

  • filter(row_number()==1) or
  • slice(1) or
  • top_n(n = -1)
    • top_n() internally uses the rank function. Negative selects from the bottom of rank.

In some instances arranging the ids after the group_by can be necessary.

library(dplyr)

# using filter(), top_n() or slice()

m1 <-
test %>% 
  group_by(id) %>% 
  filter(row_number()==1)

m2 <-
test %>% 
  group_by(id) %>% 
  slice(1)

m3 <-
test %>% 
  group_by(id) %>% 
  top_n(n = -1)

All three methods return the same result

# A tibble: 5 x 2
# Groups:   id [5]
     id string
  <int> <fct> 
1     1 A     
2     2 B     
3     3 C     
4     4 D     
5     5 E



回答4:


A simple ddply option:

ddply(test,.(id),function(x) head(x,1))

If speed is an issue, a similar approach could be taken with data.table:

testd <- data.table(test)
setkey(testd,id)
testd[,.SD[1],by = key(testd)]

or this might be considerably faster:

testd[testd[, .I[1], by = key(testd]$V1]



回答5:


(1) SQLite has a built in rowid pseudo-column so this works:

sqldf("select min(rowid) rowid, id, string 
               from test 
               group by id")

giving:

  rowid id string
1     1  1      A
2     3  2      B
3     5  3      C
4     7  4      D
5     9  5      E

(2) Also sqldf itself has a row.names= argument:

sqldf("select min(cast(row_names as real)) row_names, id, string 
              from test 
              group by id", row.names = TRUE)

giving:

  id string
1  1      A
3  2      B
5  3      C
7  4      D
9  5      E

(3) A third alternative which mixes the elements of the above two might be even better:

sqldf("select min(rowid) row_names, id, string 
               from test 
               group by id", row.names = TRUE)

giving:

  id string
1  1      A
3  2      B
5  3      C
7  4      D
9  5      E

Note that all three of these rely on a SQLite extension to SQL where the use of min or max is guaranteed to result in the other columns being chosen from the same row. (In other SQL-based databases that may not be guaranteed.)




回答6:


now, for dplyr, adding a distinct counter.

df %>%
    group_by(aa, bb) %>%
    summarise(first=head(value,1), count=n_distinct(value))

You create groups, them summarise within groups.

If data is numeric, you can use:
first(value) [there is also last(value)] in place of head(value, 1)

see: http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

Full:

> df
Source: local data frame [16 x 3]

   aa bb value
1   1  1   GUT
2   1  1   PER
3   1  2   SUT
4   1  2   GUT
5   1  3   SUT
6   1  3   GUT
7   1  3   PER
8   2  1   221
9   2  1   224
10  2  1   239
11  2  2   217
12  2  2   221
13  2  2   224
14  3  1   GUT
15  3  1   HUL
16  3  1   GUT

> library(dplyr)
> df %>%
>   group_by(aa, bb) %>%
>   summarise(first=head(value,1), count=n_distinct(value))

Source: local data frame [6 x 4]
Groups: aa

  aa bb first count
1  1  1   GUT     2
2  1  2   SUT     2
3  1  3   SUT     3
4  2  1   221     3
5  2  2   217     3
6  3  1   GUT     2



回答7:


A base R option is the split()-lapply()-do.call() idiom:

> do.call(rbind, lapply(split(test, test$id), head, 1))
  id string
1  1      A
2  2      B
3  3      C
4  4      D
5  5      E

A more direct option is to lapply() the [ function:

> do.call(rbind, lapply(split(test, test$id), `[`, 1, ))
  id string
1  1      A
2  2      B
3  3      C
4  4      D
5  5      E

The comma-space 1, ) at the end of the lapply() call is essential as this is equivalent of calling [1, ] to select first row and all columns.



来源:https://stackoverflow.com/questions/13279582/select-the-first-row-by-group

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!