What can R do about a messy data format?

前端未结

关注

 6  1828

时光说笑 2020-11-28 08:27

Sometimes I see data posted in a Stack Overflow question formatted like in this question. This is not the first time, so I have decided to ask a question about it, and

6条回答

日久生厌 (楼主)

2020-11-28 08:58

The short answer to the question is yes, R code can solve that mess and no, it doesn't take that much trouble.

The first step after copying & pasting the table into an R session is to read it in with read.table setting the header, sep, comment.char and strip.white arguments.

Credits for reminding me of arguments comment.char and strip.white go to @nicola, and his comment.

dat <- read.table(text = "
+------------+------+------+----------+--------------------------+
|    Date    | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A    | A1   |        0 |                        0 |
| 2018-06-03 | A    | A2   |        0 |                        1 |
| 2018-06-03 | A    | A3   |        0 |                        2 |
| 2018-06-03 | A    | A4   |        1 |                        1 |
| 2018-06-03 | A    | A5   |        2 |                        1 |
| 2018-06-04 | A    | A6   |        0 |                        3 |
| 2018-06-01 | B    | B1   |        0 |                        1 |
| 2018-06-02 | B    | B2   |        0 |                        2 |
| 2018-06-03 | B    | B3   |        0 |                        3 |
+------------+------+------+----------+--------------------------+
", header = TRUE, sep = "|", comment.char = "+", strip.white = TRUE)

But as you can see there are some issues with the result.

dat
   X       Date Emp1 Case Priority PriorityCountinLast7days X.1
1 NA 2018-06-01    A   A1        0                        0  NA
2 NA 2018-06-03    A   A2        0                        1  NA
3 NA 2018-06-03    A   A3        0                        2  NA
4 NA 2018-06-03    A   A4        1                        1  NA
5 NA 2018-06-03    A   A5        2                        1  NA
6 NA 2018-06-04    A   A6        0                        3  NA
7 NA 2018-06-01    B   B1        0                        1  NA
8 NA 2018-06-02    B   B2        0                        2  NA
9 NA 2018-06-03    B   B3        0                        3  NA

To have separators start and end each data row made R believe those separators mark extra columns, which is not what is meant by the original question's OP.

So the second step is to keep only the real columns. I will do this subsetting the columns by their numbers, easily done, they usually are the first and last columns.

dat <- dat[-c(1, ncol(dat))]
dat
          Date   Emp1   Case Priority PriorityCountinLast7days
1  2018-06-01   A      A1           0                        0
2  2018-06-03   A      A2           0                        1
3  2018-06-03   A      A3           0                        2
4  2018-06-03   A      A4           1                        1
5  2018-06-03   A      A5           2                        1
6  2018-06-04   A      A6           0                        3
7  2018-06-01   B      B1           0                        1
8  2018-06-02   B      B2           0                        2
9  2018-06-03   B      B3           0                        3

That wasn't too hard, much better.
In this case there is still a problem, to coerce column Date to class Date.

dat$Date <- as.Date(dat$Date)

And the result is satisfactory.

str(dat)
'data.frame':   9 obs. of  5 variables:
 $ Date                    : Date, format: "2018-06-01" "2018-06-03" ...
 $ Emp1                    : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 2 2 2
 $ Case                    : Factor w/ 9 levels "A1","A2","A3",..: 1 2 3 4 5 6 7 8 9
 $ Priority                : int  0 0 0 1 2 0 0 0 0
 $ PriorityCountinLast7days: int  0 1 2 1 1 3 1 2 3

Note that I have not set the more or less standard argument stringsAsFactors = FALSE. If needed, this should be done when running read.table.

The whole process took only 3 lines of base R code.

Finally, the end result in dput format, like it should be in the first place.

dat <-
structure(list(Date = structure(c(17683, 17685, 17685, 17685, 
17685, 17686, 17683, 17684, 17685), class = "Date"), Emp1 = c("A", 
"A", "A", "A", "A", "A", "B", "B", "B"), Case = c("A1", "A2", 
"A3", "A4", "A5", "A6", "B1", "B2", "B3"), Priority = c(0, 0, 
0, 1, 2, 0, 0, 0, 0), PriorityCountinLast7days = c(0, 1, 2, 1, 
1, 3, 1, 2, 3)), row.names = c(NA, -9L), class = "data.frame")

0 讨论(0)

查看其它6个回答