How to read data with different separators?

后端 未结 3 1361
夕颜
夕颜 2020-12-14 12:03

I have a file looks like:

a 1,2,3,5
b 4,5,6,7
c 5,6,7,8
...

That the separator between 1st and 2nd is \'\\t\', other separators are comma.

3条回答
  •  南方客
    南方客 (楼主)
    2020-12-14 12:52

    "Balanced" data

    Judging by the way you've phrased your question, it seems that you know that your data are "balanced" (rectangular).

    Are you looking for speedier options? You might want to combine fread from "data.table" with my experimental concat.split.DT function.

    The solution would look something like (replace " " with "\t" for a tab):

    concat.split.DT(fread("yourfile.txt", sep = " ", header=FALSE), "V2", ",")
    

    Let's make up some data:

    x <- c("a\t1,2,3,5", "b\t4,5,6,7","c\t5,6,7,8")
    X <- c(replicate(10000, x))
    temp <- tempfile()
    writeLines(X, temp, sep="\n") ## Write it to a temporary file
    

    Josh's answer:

    system.time(out1 <- read.table(text = gsub(",", "\t", readLines(temp))))
    #    user  system elapsed 
    #   0.679   0.000   0.676 
    head(out1)
    #   V1 V2 V3 V4 V5
    # 1  a  1  2  3  5
    # 2  b  4  5  6  7
    # 3  c  5  6  7  8
    # 4  a  1  2  3  5
    # 5  b  4  5  6  7
    # 6  c  5  6  7  8
    dim(out1)
    # [1] 30000     5
    

    fread + concat.split.DT (which is like using fread twice, but is still super fast):

    system.time(out2 <- concat.split.DT(fread(temp, sep = "\t", header=FALSE), "V2", ","))
    #    user  system elapsed 
    #   0.027   0.000   0.028 
    head(out2)
    #    V1 V2_1 V2_2 V2_3 V2_4
    # 1:  a    1    2    3    5
    # 2:  b    4    5    6    7
    # 3:  c    5    6    7    8
    # 4:  a    1    2    3    5
    # 5:  b    4    5    6    7
    # 6:  c    5    6    7    8
    dim(out2)
    # [1] 30000     5
    

    "Unbalanced" data

    Although it doesn't apply to your problem, I should mention this for the benefit of others who might need to solve a similar problem:

    One limitation of the above is that concat.split.DT only handles "balanced" data. fread doesn't have a fill argument like read.table does (and I seem to remember reading somewhere that it most likely won't have such an argument).

    Here's an example of what I mean by unbalanced:

    x2 <- c("a\t1,2,3,5,6,7", "b\t4,5,6,7","c\t5,6,7,8,9,10,11,12,13")
    X2 <- c(replicate(10000, x2))
    temp2 <- tempfile()
    writeLines(X2, temp2, sep="\n")
    

    read.table can handle that with the fill = TRUE argument:

    system.time(out1b <- read.table(text = gsub(",", "\t", readLines(temp2)), fill=TRUE))
    #    user  system elapsed 
    #   1.151   0.000   1.152 
    head(out1b)
    #   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
    # 1  a  1  2  3  5  6  7 NA NA  NA
    # 2  b  4  5  6  7 NA NA NA NA  NA
    # 3  c  5  6  7  8  9 10 11 12  13
    # 4  a  1  2  3  5  6  7 NA NA  NA
    # 5  b  4  5  6  7 NA NA NA NA  NA
    # 6  c  5  6  7  8  9 10 11 12  13
    

    concat.split.DT will give you a nasty error in such cases, but you can try my cSplit function instead. It's not nearly as fast, but still performs decently:

    system.time(out2b <- cSplit(fread(temp2, sep = "\t", header=FALSE), "V2", ","))
    #    user  system elapsed 
    #   0.393   0.004   0.399 
    head(out2b)
    #    V1 V2_1 V2_2 V2_3 V2_4 V2_5 V2_6 V2_7 V2_8 V2_9
    # 1:  a    1    2    3    5    6    7   NA   NA   NA
    # 2:  b    4    5    6    7   NA   NA   NA   NA   NA
    # 3:  c    5    6    7    8    9   10   11   12   13
    # 4:  a    1    2    3    5    6    7   NA   NA   NA
    # 5:  b    4    5    6    7   NA   NA   NA   NA   NA
    # 6:  c    5    6    7    8    9   10   11   12   13
    

提交回复
热议问题