What can R do about a messy data format?

前端 未结 6 1828
时光说笑
时光说笑 2020-11-28 08:27

Sometimes I see data posted in a Stack Overflow question formatted like in this question. This is not the first time, so I have decided to ask a question about it, and

6条回答
  •  日久生厌
    2020-11-28 08:58

    The short answer to the question is yes, R code can solve that mess and no, it doesn't take that much trouble.

    The first step after copying & pasting the table into an R session is to read it in with read.table setting the header, sep, comment.char and strip.white arguments.

    Credits for reminding me of arguments comment.char and strip.white go to @nicola, and his comment.

    dat <- read.table(text = "
    +------------+------+------+----------+--------------------------+
    |    Date    | Emp1 | Case | Priority | PriorityCountinLast7days |
    +------------+------+------+----------+--------------------------+
    | 2018-06-01 | A    | A1   |        0 |                        0 |
    | 2018-06-03 | A    | A2   |        0 |                        1 |
    | 2018-06-03 | A    | A3   |        0 |                        2 |
    | 2018-06-03 | A    | A4   |        1 |                        1 |
    | 2018-06-03 | A    | A5   |        2 |                        1 |
    | 2018-06-04 | A    | A6   |        0 |                        3 |
    | 2018-06-01 | B    | B1   |        0 |                        1 |
    | 2018-06-02 | B    | B2   |        0 |                        2 |
    | 2018-06-03 | B    | B3   |        0 |                        3 |
    +------------+------+------+----------+--------------------------+
    ", header = TRUE, sep = "|", comment.char = "+", strip.white = TRUE)
    

    But as you can see there are some issues with the result.

    dat
       X       Date Emp1 Case Priority PriorityCountinLast7days X.1
    1 NA 2018-06-01    A   A1        0                        0  NA
    2 NA 2018-06-03    A   A2        0                        1  NA
    3 NA 2018-06-03    A   A3        0                        2  NA
    4 NA 2018-06-03    A   A4        1                        1  NA
    5 NA 2018-06-03    A   A5        2                        1  NA
    6 NA 2018-06-04    A   A6        0                        3  NA
    7 NA 2018-06-01    B   B1        0                        1  NA
    8 NA 2018-06-02    B   B2        0                        2  NA
    9 NA 2018-06-03    B   B3        0                        3  NA
    

    To have separators start and end each data row made R believe those separators mark extra columns, which is not what is meant by the original question's OP.

    So the second step is to keep only the real columns. I will do this subsetting the columns by their numbers, easily done, they usually are the first and last columns.

    dat <- dat[-c(1, ncol(dat))]
    dat
              Date   Emp1   Case Priority PriorityCountinLast7days
    1  2018-06-01   A      A1           0                        0
    2  2018-06-03   A      A2           0                        1
    3  2018-06-03   A      A3           0                        2
    4  2018-06-03   A      A4           1                        1
    5  2018-06-03   A      A5           2                        1
    6  2018-06-04   A      A6           0                        3
    7  2018-06-01   B      B1           0                        1
    8  2018-06-02   B      B2           0                        2
    9  2018-06-03   B      B3           0                        3
    

    That wasn't too hard, much better.
    In this case there is still a problem, to coerce column Date to class Date.

    dat$Date <- as.Date(dat$Date)
    

    And the result is satisfactory.

    str(dat)
    'data.frame':   9 obs. of  5 variables:
     $ Date                    : Date, format: "2018-06-01" "2018-06-03" ...
     $ Emp1                    : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 2 2 2
     $ Case                    : Factor w/ 9 levels "A1","A2","A3",..: 1 2 3 4 5 6 7 8 9
     $ Priority                : int  0 0 0 1 2 0 0 0 0
     $ PriorityCountinLast7days: int  0 1 2 1 1 3 1 2 3
    

    Note that I have not set the more or less standard argument stringsAsFactors = FALSE. If needed, this should be done when running read.table.

    The whole process took only 3 lines of base R code.

    Finally, the end result in dput format, like it should be in the first place.

    dat <-
    structure(list(Date = structure(c(17683, 17685, 17685, 17685, 
    17685, 17686, 17683, 17684, 17685), class = "Date"), Emp1 = c("A", 
    "A", "A", "A", "A", "A", "B", "B", "B"), Case = c("A1", "A2", 
    "A3", "A4", "A5", "A6", "B1", "B2", "B3"), Priority = c(0, 0, 
    0, 1, 2, 0, 0, 0, 0), PriorityCountinLast7days = c(0, 1, 2, 1, 
    1, 3, 1, 2, 3)), row.names = c(NA, -9L), class = "data.frame")
    

提交回复
热议问题