read/write data in libsvm format

前端 未结 7 2094
慢半拍i
慢半拍i 2020-11-30 11:05

How do I read/write libsvm data into/from R?

The libsvm format is sparse data like

[ 

        
7条回答
  •  不知归路
    2020-11-30 11:51

    I came up with my own ad hoc solution leveraging some data.table utilities,

    It ran in almost no time on the test data set I found (Boston Housing data).

    Converting that to a data.table (orthogonal to solution, but adding here for easy reproducibility):

    library(data.table)
    x = fread("/media/data_drive/housing.data.fw",
              sep = "\n", header = FALSE)
    #usually fixed-width conversion is harder, but everything here is numeric
    columns =  c("CRIM", "ZN", "INDUS", "CHAS",
                 "NOX", "RM", "AGE", "DIS", "RAD", 
                 "TAX", "PTRATIO", "B", "LSTAT", "MEDV")
    DT = with(x, fread(paste(gsub("\\s+", "\t", V1), collapse = "\n"),
                       header = FALSE, sep = "\t",
                       col.names = columns))
    

    Here it is:

    DT[ , fwrite(as.data.table(paste0(
      MEDV, " | ", sapply(transpose(lapply(
        names(.SD), function(jj)
          paste0(jj, ":", get(jj)))),
        paste, collapse = " "))), 
      "/path/to/output", col.names = FALSE, quote = FALSE),
      .SDcols = !"MEDV"]
    #what gets sent to as.data.table:
    #[1] "24 | CRIM:0.00632 ZN:18 INDUS:2.31 CHAS:0 NOX:0.538 RM:6.575 
    #  AGE:65.2 DIS:4.09 RAD:1 TAX:296 PTRATIO:15.3 B:396.9 LSTAT:4.98 MEDV:24"      
    #[2] "21.6 | CRIM:0.02731 ZN:0 INDUS:7.07 CHAS:0 NOX:0.469 RM:6.421 
    #  AGE:78.9 DIS:4.9671 RAD:2 TAX:242 PTRATIO:17.8 B:396.9 LSTAT:9.14 MEDV:21.6"
    # ...
    

    There may be a better way to get this understood by fwrite than as.data.table, but I can't think of one (until setDT works on vectors).

    I replicated this to test its performance on a bigger data set (just blow up the current data set):

    DT2 = rbindlist(replicate(1000, DT, simplify = FALSE))
    

    The operation was pretty fast compared to some of the times reported here (I haven't bothered comparing directly yet):

    system.time(.)
    #    user  system elapsed 
    #   8.392   0.000   8.385 
    

    I also tested using writeLines instead of fwrite, but the latter was better.


    I am looking again and seeing it might take a while to figure out what's going on. Maybe the magrittr-piped version will be easier to follow:

    DT[ , 
        #1) prepend each column's values with the column name
        lapply(names(.SD), function(jj)
          paste0(jj, ":", get(jj))) %>%
          #2) transpose this list (using data.table's fast tool)
          #   (was column-wise, now row-wise)
          #3) concatenate columns, separated by " "
          transpose %>% sapply(paste, collapse = " ") %>%
          #4) prepend each row with the target value
          #   (with Vowpal Wabbit in mind, separate with a pipe)
          paste0(MEDV, " | ", .) %>%
          #5) convert this to a data.table to use fwrite
          as.data.table %>%
          #6) fwrite it; exclude nonsense column name,
          #   and force quotes off
          fwrite("/path/to/data", 
                 col.names = FALSE, quote = FALSE),
      .SDcols = !"MEDV"]
    

    reading in such files is much easier**

    #quickly read data; don't split within lines
    x = fread("/path/to/data", sep = "\n", header = FALSE)
    
    #tstrsplit is transpose(strsplit(.))
    dt1 = x[ , tstrsplit(V1, split = "[| :]+")]
    
    #even columns have variable names
    nms = c("target_name", 
            unlist(dt1[1L, seq(2L, ncol(dt1), by = 2L), 
                       with = FALSE]))
    
    #odd columns have values
    DT = dt1[ , seq(1L, ncol(dt1), by = 2L), with = FALSE]
    #add meaningful names
    setnames(DT, nms)
    

    **this will not work with "ragged"/sparse input data. I don't think there's a way to extend this to work in such cases.

提交回复
热议问题