Spread with data.frame/tibble with duplicate identifiers

∥☆過路亽.° 提交于 2019-11-26 03:58:02

问题


The documentation for tidyr suggests that gather and spread are transitive, but the following example with the \"iris\" data shows they are not, but it is not clear why. Any clarification would be greatly appreciated

iris.df = as.data.frame(iris)
long.iris.df = iris.df %>% gather(key = feature.measure, value = size, -Species)
w.iris.df = long.iris.df %>% spread(key = feature.measure, value = size, -Species)

I expected the data frame \"w.iris.df\" to be the same as \"iris.df\" but received the following error instead:

\"Error: Duplicate identifiers for rows (1, 2, 3, 4, 5, 6, 7, 8, 9...\"

My general question is how to reverse an application of \"gather\" on this sort of dataset.


回答1:


Hadley's intervention was unsurprisingly perfect... but I ended up mucking with the syntax a bit after that... so for what it's worth, I post the fully operational code (sorry my syntax is a bit different than above):

library(tidyr)
library(dplyr)

wide <- 
  iris %>%
  mutate(row = row_number()) %>%
  gather(vars, val, -Species, -row) %>%
  spread(vars, val)

head(wide)
#   Species row Petal.Length Petal.Width Sepal.Length Sepal.Width
# 1  setosa   1          1.4         0.2          5.1         3.5
# 2  setosa   2          1.4         0.2          4.9         3.0
# 3  setosa   3          1.3         0.2          4.7         3.2
# 4  setosa   4          1.5         0.2          4.6         3.1
# 5  setosa   5          1.4         0.2          5.0         3.6
# 6  setosa   6          1.7         0.4          5.4         3.9

head(iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.3         0.2  setosa
# 4          4.6         3.1          1.5         0.2  setosa
# 5          5.0         3.6          1.4         0.2  setosa
# 6          5.4         3.9          1.7         0.4  setosa

They are the same.... just need to reorder if u feel like it...

wide <- wide[,c(3, 4, 5, 6, 1)]  ## Reorder and then remove "row" column

and done.




回答2:


As the previous answer may not have been sufficiently clear, there's a problem with how you are performing gather that shows itself when you are trying to spread.

The issue is that in the process of gathering you are losing track of which feature.measure belong to which row in the original dataframe, so spread has no idea how to combine individual values into the the "wide" table again.

iris.df = as.data.frame(iris)
long.iris.df = iris.df %>% 
  tibble::rowid_to_column() %>% 
  gather(key = feature.measure, value = size, -Species, -rowid)

#>   rowid Species feature.measure size
#> 1     1  setosa    Sepal.Length  5.1
#> 2     2  setosa    Sepal.Length  4.9
#> 3     3  setosa    Sepal.Length  4.7
#> 4     4  setosa    Sepal.Length  4.6
#> 5     5  setosa    Sepal.Length  5.0
#> 6     6  setosa    Sepal.Length  5.4

Now each value in size preserves its rowid so you will always be able to recombine it back to the wide dataset (removing unnecessary rowid):

w.iris.df = long.iris.df %>% 
  spread(key = feature.measure, value = size) %>% 
  select(-rowid)
head(w.iris.df)
#>   Species Petal.Length Petal.Width Sepal.Length Sepal.Width
#> 1  setosa          1.4         0.2          5.1         3.5
#> 2  setosa          1.4         0.2          4.9         3.0
#> 3  setosa          1.3         0.2          4.7         3.2
#> 4  setosa          1.5         0.2          4.6         3.1
#> 5  setosa          1.4         0.2          5.0         3.6
#> 6  setosa          1.7         0.4          5.4         3.9


来源:https://stackoverflow.com/questions/25960394/spread-with-data-frame-tibble-with-duplicate-identifiers

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!