Transposing rows into columns, then split them

心不动则不痛 提交于 2019-12-24 07:59:44

问题


I have a data set containing data sorted in rows like this:

*VarName1* - *VarValue1*
*VarName2* - *VarValue2*
*Etc.*

I want it to be that the VarNames become individual columns. I have achieved this by using the following code:

DFP1 <- as.data.frame(t(DFP)) #DFP contains the data

Now, this is a very big data set. It contains multiple years (millions of rows) of data. Above code creates a dataframe which has > 1E6 columns. I need to split these columns by each entry. I saw that in the first piece of data, a new entry recurs at every 86th column. So, I tried this:

tmp <- data.frame(
       X = DFP$noFloat,
       ind = rep(1:86, nrow(DFP)/86)
)

y <- rbind(DFP$nmlVar[1:86], unstack(tmp, X~ind))

This works for a few rows. The problem is that the number of variables increased over the years and that I cannot simply assume that the number of variables per entry are the same. This results in variable values mismatching it's names. I am looking for a way to match variables and values based on their variable names.

I am new to advanced data-analysis, so please let me know if you need anything more.

EDIT: I created some sample data of how DFP looks like, to hopefully make you better understand my question:

DFP <- data.frame(
    nmlVar = c("Batch", "Mass", "Length", "Product","Batch", "Mass", 
    "Length", "Product", "Batch", "Mass", "Length", "Width", "Product"),

    noFloat = c(254578, 20, 24, 24547, 254579, 23, 24, 24547, 254580, 20, 
    24, 19, 24547)
)

Important to note here is the apperance of new variable width in the third recurrence. This is typical for my dataset, introduction of new variables. The key indicator here is batch and it should be split at each time the variable batch appears.

dput output of sample data:

structure(list(nmlVar = structure(c(1L, 3L, 2L, 4L, 1L, 3L, 2L, 
4L, 1L, 3L, 2L, 5L, 4L), .Label = c("Batch", "Length", "Mass", 
"Product", "Width"), class = "factor"), noFloat = c(254578, 20, 
24, 24547, 254579, 23, 24, 24547, 254580, 20, 24, 19, 24547)), .Names = c("nmlVar", 
"noFloat"), row.names = c(NA, -13L), class = "data.frame")

回答1:


Is this what you are after?:

library(dplyr)
library(tidyr)
DFP %>% 
  mutate(sample = cumsum(nmlVar == 'Batch')) %>% 
  spread(nmlVar, noFloat)

Gives:

  sample  Batch Length Mass Product Width
1      1 254578     24   20   24547    NA
2      2 254579     24   23   24547    NA
3      3 254580     24   20   24547    19


来源:https://stackoverflow.com/questions/46522166/transposing-rows-into-columns-then-split-them

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!