Search multiple columns for string to set indicator variable

前端 未结 3 1862
一整个雨季
一整个雨季 2021-01-23 13:57

I am using R and RStudio for the first time to work with a very large dataset (15 million cases) with many columns of data. To facilitate analysis, I need to search a range of

3条回答
  •  日久生厌
    2021-01-23 14:26

    In base R, we can implement a reusable function, in only a few lines of code, but requires some knowledge about which functions to use and how.

    I'll call the function bag, as in bag-of-words.

    bag <- function(..., prefix=".", levels=NULL, `NA`=NULL) {
    
      # Go from multiple columns to list of vectors
      bags <- mapply(c, ..., SIMPLIFY = FALSE, USE.NAMES = FALSE)
    
      # Find unique levels
      if(is.null(levels)) {
        levels <- sort(Reduce(union, bags))
    
        # names persist through outer
        names(levels) <- paste0(prefix, levels)
      }
    
      # Calculate out[level,bag] = level %in% bag 
      out <- outer(levels, bags, Vectorize(`%in%`))
    
      # Output a data structure
      structure(+t(out), class='bag', levels=levels)
    }
    

    Which would let us do:

    with(df1, bag(Dx1, Dx2, Dx3, prefix="Var"))
    #>      Var001 Var231 Var234 Var245 Var444 Var456 Var777
    #> [1,]      1      0      1      0      0      1      0
    #> [2,]      1      1      0      0      1      0      0
    #> [3,]      1      0      0      1      0      0      1
    #> attr(,"class")
    #> [1] "bag"
    #> attr(,"levels")
    #> Var001 Var231 Var234 Var245 Var444 Var456 Var777 
    #>  "001"  "231"  "234"  "245"  "444"  "456"  "777"
    

    This is probably not very performant, but it works. I've changed the output format from logical to numeric and included some metadata to make it easier to use in a model. We can add a function to enable modeling with bag directly:

    #' @export
    makepredictcall.bag <- function(var, call){
      # Stolen from splines package
      if (as.character(call)[1L] != "bag")
        return(call)
      args <- c("prefix", "levels")
    
    
      at <- attributes(var)[args]
      xxx <- call
      xxx[args] <- NULL
      xxx[names(at)] <- at
      xxx
    }
    

    Now, you can use it directly in a model formula. This has the advantage that the dummy coding is now incorporated in to the model and you won't need to preprocess when predicting on new data sets. Example:

    df2 <- as.data.frame(lapply(df1, sample, 20, TRUE), stringsAsFactors = FALSE)
    df3 <- as.data.frame(lapply(df1, sample, 20, TRUE), stringsAsFactors = FALSE)
    
    Y <- 1:nrow(df2)
    m <- lm(Y~bag(Dx1, Dx2, Dx3), df2)
    summary(m)
    #> 
    #> Call:
    #> lm(formula = Y ~ bag(Dx1, Dx2, Dx3), data = df2)
    #> 
    #> Residuals:
    #>     Min      1Q  Median      3Q     Max 
    #> -8.1110 -3.6765  0.1948  3.1899  8.7961 
    #> 
    #> Coefficients:
    #>                        Estimate Std. Error t value Pr(>|t|)
    #> (Intercept)             16.6709    10.3948   1.604    0.135
    #> bag(Dx1, Dx2, Dx3).001  -3.7385     5.6141  -0.666    0.518
    #> bag(Dx1, Dx2, Dx3).231  -3.7286     4.1728  -0.894    0.389
    #> bag(Dx1, Dx2, Dx3).234   3.1786     4.6528   0.683    0.507
    #> bag(Dx1, Dx2, Dx3).245  -7.2493     4.4900  -1.615    0.132
    #> bag(Dx1, Dx2, Dx3).444  -2.2936     4.3033  -0.533    0.604
    #> bag(Dx1, Dx2, Dx3).456   2.9979     4.3826   0.684    0.507
    #> bag(Dx1, Dx2, Dx3).777  -0.8608     4.5353  -0.190    0.853
    #> 
    #> Residual standard error: 5.971 on 12 degrees of freedom
    #> Multiple R-squared:  0.3566, Adjusted R-squared:  -0.01874 
    #> F-statistic: 0.9501 on 7 and 12 DF,  p-value: 0.5056
    predict(m, df3)
    #>         1         2         3         4         5         6         7 
    #>  8.681003 16.111016  4.822329 15.079445 19.108899 10.306611 13.817465 
    #>         8         9        10        11        12        13        14 
    #> 16.111016  9.788011 12.382454  9.778103  3.389569 12.382454  9.203882 
    #>        15        16        17        18        19        20 
    #> 13.817465  9.788011 12.071654  6.267249 13.827373 15.069537
    

    Created on 2019-08-06 by the reprex package (v0.3.0)

    EDIT:

    And some benchmarks for comparison

    microbenchmark::microbenchmark(mtab = mtabulate(as.data.frame(t(df1)))!=0,
                                   lapply = lapply(as.character(unique(melt(df1, id.vars = NULL)$value)), 
                                                   function(x) rowSums(df1==x) > 0),
                                   bag = do.call(bag, df1))
    #> Unit: microseconds
    #>    expr     min      lq     mean   median       uq      max neval
    #>    mtab 439.320 452.107 519.9429 462.9035 511.8710 1960.582   100
    #>  lapply 276.914 295.976 337.6020 300.7870 315.0135 2268.210   100
    #>     bag 121.996 130.305 146.6677 139.6990 145.3275  294.711   100
    

提交回复
热议问题