How to number/label data-table by group-number from group_by?

旧城冷巷雨未停 提交于 2019-11-26 02:02:50

问题


I have a tbl_df where I want to group_by(u, v) for each distinct integer combination observed with (u, v).


EDIT: this was resolved by adding group_indices() back in dplyr 0.4.0


a) I then want to assign each distinct group some arbitrary distinct number label=1,2,3... e.g. the combination (u,v)==(2,3) could get label 1, (1,3) could get 2, and so on. How to do this with one mutate(), without a three-step summarize-and-self-join?

dplyr has a neat function n(), but that gives the number of elements within its group, not the overall number of the group. In data.table this would simply be called .GRP.

b) Actually what I really want to assign a string/character label (\'A\',\'B\',...). But numbering groups by integers is good-enough, because I can then use integer_to_label(i) as below. Unless there\'s a clever way to merge these two? But don\'t sweat this part.

set.seed(1234)

# Helper fn for mapping integer 1..26 to character label
integer_to_label <- function(i) { substr(\"ABCDEFGHIJKLMNOPQRSTUVWXYZ\",i,i) }

df <- tbl_df(data.frame(u=sample.int(3,10,replace=T), v=sample.int(4,10,replace=T)))

# Want to label/number each distinct group of unique (u,v) combinations
df %>% group_by(u,v) %>% mutate(label = n()) # WRONG: n() is number of element within its group, not overall number of group

   u v
1  2 3
2  1 3
3  1 2
4  2 3
5  1 2
6  3 3
7  1 3
8  1 2
9  3 1
10 3 4

KLUDGE 1: could do df %>% group_by(u,v) %>% summarize(label = n()) , then self-join

回答1:


Updated answer

get_group_number = function(){
    i = 0
    function(){
        i <<- i+1
        i
    }
}
group_number = get_group_number()
df %>% group_by(u,v) %>% mutate(label = group_number())

You can also consider the following slightly unreadable version

group_number = (function(){i = 0; function() i <<- i+1 })()
df %>% group_by(u,v) %>% mutate(label = group_number())

using iterators package

library(iterators)

counter = icount()
df %>% group_by(u,v) %>% mutate(label = nextElem(counter))



回答2:


dplyr has a group_indices() function that you can use like this:

df %>% 
    mutate(label = group_indices(., u, v)) %>% 
    group_by(label) ...



回答3:


Another approach using data.table would be

require(data.table)
setDT(df)[,label:=.GRP, by = c("u", "v")]

which results in:

    u v label
 1: 2 1     1
 2: 1 3     2
 3: 2 1     1
 4: 3 4     3
 5: 3 1     4
 6: 1 1     5
 7: 3 2     6
 8: 2 3     7
 9: 3 2     6
10: 3 4     3



回答4:


Updating my answer with three different ways:

A) A neat non-dplyr solution using interaction(u,v):

> df$label <- factor(interaction(df$u,df$v, drop=T))
 [1] 1.3 2.3 2.2 2.4 3.2 2.4 1.2 1.2 2.1 2.1
 Levels: 2.1 1.2 2.2 3.2 1.3 2.3 2.4

> match(df$label, levels(df$label)[ rank(unique(df$label)) ] )
 [1] 1 2 3 4 5 4 6 6 7 7

B) Making Randy's neat fast-and-dirty generator-function answer more compact:

get_next_integer = function(){
  i = 0
  function(u,v){ i <<- i+1 }
}
get_integer = get_next_integer() 

df %>% group_by(u,v) %>% mutate(label = get_integer())

C) Also here is a one-liner using a generator function abusing a global variable assignment from this:

i <- 0
generate_integer <- function() { return(assign('i', i+1, envir = .GlobalEnv)) }

df %>% group_by(u,v) %>% mutate(label = generate_integer())

rm(i)



回答5:


I don't have enough reputation for a comment, so I'm posting an answer instead.

The solution using factor() is a good one, but it has the disadvantage that group numbers are assigned after factor() alphabetizes its levels. The same behaviour happens with dplyr's group_indices(). Perhaps you would like the group numbers to be assigned from 1 to n based on the current group order. In which case, you can use:

my_tibble %>% mutate(group_num = as.integer(factor(group_var, levels = unique(.$group_var))) )


来源:https://stackoverflow.com/questions/23026145/how-to-number-label-data-table-by-group-number-from-group-by

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!