De-aggregate / reverse-summarise / expand a dataset in R

一曲冷凌霜 提交于 2019-12-29 07:41:58

问题


My data looks like this:

data("Titanic")
df <- as.data.frame(Titanic)

How can I de-aggregate or reverse-summarise count/freq and expand the data set back to it's original non-count observation state?

For instance, I want 3rd, Male, Child, No repeated 35 times and 1st, Female, Adult, Yes repeated 140 times, etc, etc, in the dataframe.

Thanks in advance.


回答1:


Without packages we can repeat each row according to the frequencies given:

df2 <- df[rep(1:nrow(df), df[,5]),-5]



回答2:


You can do this with list columns and a few dplyr/tidyr/purrr verbs. It's not as compact as other base R solutions may be, but for me it's easier to understand how things fit together and it works in a larger tidyverse pipeflow.

As a check first-off, we're expecting 2,201 rows in the final data frame:

library(dplyr)
library(tidyr)
library(purrr)

sum(df$Freq)
#> [1] 2201

Converting this to a tibble makes it easier to see and work with the list column. I'm using purrr::map to move along the Freq column, creating a vector of a dummy marker the length of the Freq value. In this case, that marker is just "1"; it could instead be TRUE or anything else. The point it just that it will create a vector of the length Freq.

df %>%
  as_tibble() %>%
  mutate(obs = map(Freq, ~rep_len(1, .x)))
#> # A tibble: 32 x 6
#>    Class Sex    Age   Survived  Freq obs        
#>    <fct> <fct>  <fct> <fct>    <dbl> <list>     
#>  1 1st   Male   Child No           0 <dbl [0]>  
#>  2 2nd   Male   Child No           0 <dbl [0]>  
#>  3 3rd   Male   Child No          35 <dbl [35]> 
#>  4 Crew  Male   Child No           0 <dbl [0]>  
#>  5 1st   Female Child No           0 <dbl [0]>  
#>  6 2nd   Female Child No           0 <dbl [0]>  
#>  7 3rd   Female Child No          17 <dbl [17]> 
#>  8 Crew  Female Child No           0 <dbl [0]>  
#>  9 1st   Male   Adult No         118 <dbl [118]>
#> 10 2nd   Male   Adult No         154 <dbl [154]>
#> # … with 22 more rows

Then tidyr::unnest creates a row for each element in that dummy vector. After that, I drop those last 2 columns to just have the important categories of class, sex, age, and survival.

df %>%
  as_tibble() %>%
  mutate(obs = map(Freq, ~rep_len(1, .x))) %>%
  unnest() %>%
  select(-Freq, -obs)
#> # A tibble: 2,201 x 4
#>    Class Sex   Age   Survived
#>    <fct> <fct> <fct> <fct>   
#>  1 3rd   Male  Child No      
#>  2 3rd   Male  Child No      
#>  3 3rd   Male  Child No      
#>  4 3rd   Male  Child No      
#>  5 3rd   Male  Child No      
#>  6 3rd   Male  Child No      
#>  7 3rd   Male  Child No      
#>  8 3rd   Male  Child No      
#>  9 3rd   Male  Child No      
#> 10 3rd   Male  Child No      
#> # … with 2,191 more rows

In the end, it is in fact a data frame of 2,201 rows.




回答3:


You can do this using the function untable in reshape.

data("Titanic")
df <- as.data.frame(Titanic)

library(reshape)
newDf = untable(df[,1:4], num = df[,5])


来源:https://stackoverflow.com/questions/38208529/de-aggregate-reverse-summarise-expand-a-dataset-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!