Get the number of lines in a text file using R

后端 未结 5 1262
伪装坚强ぢ
伪装坚强ぢ 2020-12-05 10:51

Is there a way to get the number of lines in a file without importing it?

So far this is what I am doing

myfiles <- list.files(pattern=\"*.dat\")
         


        
5条回答
  •  无人及你
    2020-12-05 11:13

    If you:

    • still want to avoid the system call that a system2("wc"… will cause
    • are on BSD/Linux or OS X (I didn't test the following on Windows)
    • don't mind a using a full filename path
    • are comfortable using the inline package

    then the following should be about as fast as you can get (it's pretty much the 'line count' portion of wc in an inline R C function):

    library(inline)
    
    wc.code <- "
    uintmax_t linect = 0; 
    uintmax_t tlinect = 0;
    
    int fd, len;
    u_char *p;
    
    struct statfs fsb;
    
    static off_t buf_size = SMALL_BUF_SIZE;
    static u_char small_buf[SMALL_BUF_SIZE];
    static u_char *buf = small_buf;
    
    PROTECT(f = AS_CHARACTER(f));
    
    if ((fd = open(CHAR(STRING_ELT(f, 0)), O_RDONLY, 0)) >= 0) {
    
      if (fstatfs(fd, &fsb)) {
        fsb.f_iosize = SMALL_BUF_SIZE;
      }
    
      if (fsb.f_iosize != buf_size) {
        if (buf != small_buf) {
          free(buf);
        }
        if (fsb.f_iosize == SMALL_BUF_SIZE || !(buf = malloc(fsb.f_iosize))) {
          buf = small_buf;
          buf_size = SMALL_BUF_SIZE;
        } else {
          buf_size = fsb.f_iosize;
        }
      }
    
      while ((len = read(fd, buf, buf_size))) {
    
        if (len == -1) {
          (void)close(fd);
          break;
        }
    
        for (p = buf; len--; ++p)
          if (*p == '\\n')
            ++linect;
      }
    
      tlinect += linect;
    
      (void)close(fd);
    
    }
    SEXP result;
    PROTECT(result = NEW_INTEGER(1));
    INTEGER(result)[0] = tlinect;
    UNPROTECT(2);
    return(result);
    ";
    
    setCMethod("wc",
               signature(f="character"), 
               wc.code,
               includes=c("#include ", 
                          "#include ",
                          "#include ",
                          "#include ",
                          "#include ",
                          "#include ",
                          "#include ",
                          "#include ",
                          "#include ",
                          "#include ",
                          "#include ",
                          "#include ",
                          "#include ",
                          "#include ",
                          "#include ",
                          "#define SMALL_BUF_SIZE (1024 * 8)"),
               language="C",
               convention=".Call")
    
    wc("FULLPATHTOFILE")
    

    It'd be better as a package since it actually has to compile the first time through. But, it's here for reference if you really do need "speed". For a 189,955 line file I had lying around, I get (mean values from a bunch of runs):

       user  system elapsed 
      0.007   0.003   0.010 
    

提交回复
热议问题