R: `split` preserving natural order of factors

前端 未结 1 1137
执笔经年
执笔经年 2020-12-06 00:30

split will always order the splits lexicographically. There may be situations where one would rather preserve the natural order. One can always implement a hand

相关标签:
1条回答
  • 2020-12-06 01:02

    split converts the f (second) argument to factors, if it isn't already one. So, if you want the order to be retained, factor the column yourself with the desired level. That is:

    df$yearmon <- factor(df$yearmon, levels=unique(df$yearmon))
    # now split
    split(df, df$yearmon)
    # $`4_2013`
    #   Date.of.Inclusion Securities.Included Securities.Excluded yearmon
    # 1        2013-04-01          INDUSINDBK             SIEMENS  4_2013
    # 2        2013-04-01                NMDC               WIPRO  4_2013
    
    # $`9_2012`
    #   Date.of.Inclusion Securities.Included Securities.Excluded yearmon
    # 3        2012-09-28               LUPIN                SAIL  9_2012
    # 4        2012-09-28          ULTRACEMCO                STER  9_2012
    
    # $`4_2012`
    #   Date.of.Inclusion Securities.Included Securities.Excluded yearmon
    # 5        2012-04-27          ASIANPAINT                RCOM  4_2012
    # 6        2012-04-27          BANKBARODA              RPOWER  4_2012
    

    But do not use split. Use data.table instead:

    However normally, split tends to be terribly slow as the levels increase. So, I'd suggest using data.table to subset to a list. I'd suppose that'd be much faster!

    require(data.table)
    dt <- data.table(df)
    dt[, grp := .GRP, by = yearmon]
    setkey(dt, grp)
    o2 <- dt[, list(list(.SD)), by = grp]$V1
    

    Benchmarking on huge data:

    set.seed(45)
    dates <- seq(as.Date("1900-01-01"), as.Date("2013-12-31"), by = "days")
    ym <- do.call(paste, c(expand.grid(1:500, 1900:2013), sep="_"))
    
    df <- data.frame(x1 = sample(dates, 1e4, TRUE), 
                     x2 = sample(letters, 1e4, TRUE), 
                     x3 = sample(10, 1e4, TRUE), 
                     yearmon = sample(ym, 1e4, TRUE), 
          stringsAsFactors=FALSE)
    
    require(data.table)
    dt <- data.table(df)
    
    f1 <- function(dt) {
        dt[, grp := .GRP, by = yearmon]
        setkey(dt, grp)
    
        o1 <- dt[, list(list(.SD)), by=grp]$V1
    }
    
    f2 <- function(df) {
        df$yearmon <- factor(df$yearmon, levels=unique(df$yearmon))
        o2 <- split(df, df$yearmon)
    }
    
    require(microbenchmark)
    microbenchmark(o1 <- f1(dt), o2 <- f2(df), times = 10)
    
    # Unit: milliseconds
             expr        min         lq     median        uq      max neval
    #  o1 <- f1(dt)   43.72995   43.85035   45.20087  715.1292 1071.976    10
    #  o2 <- f2(df) 4485.34205 4916.13633 5210.88376 5763.1667 6912.741    10
    

    Note that the solution from o1 will be an unnamed list. But you can set the names simply by doing names(o1) <- unique(dt$yearmon)

    0 讨论(0)
提交回复
热议问题