Repeat the rows in a data frame based on values in a specific column [duplicate]

≯℡__Kan透↙ 提交于 2019-11-29 15:15:57

You can achieve this using base R (i.e. avoiding data.tables), with the following code:

df <- 'chr start end samples
        1   10   20    2
        2   4    10    3'

df <- read.table(text = df, header = TRUE)

duplicate_rows <- function(chr, starts, ends, samples) {
  expanded_samples <- paste0(chr, "-", starts, "-", ends, "-", "s", 1:samples)
  repeated_rows <- data.frame("chr" = chr, "starts" = starts, "ends" = ends, "samples" = expanded_samples)

  repeated_rows
}

expanded_rows <- Map(f = duplicate_rows, df$chr, df$start, df$end, df$samples)

new_df <- do.call(rbind, expanded_rows)

The basic idea is to define a function that will take a single row from your initial data.frame and duplicate rows based on the value in the samples column (as well as creating the distinct character strings you're after). This function is then applied to each row of your initial data.frame. The output is a list of data.frames that then need to be re-combined into a single data.frame using the do.call pattern.

The above code can be made cleaner by using the Hadley Wickham's purrr package (on CRAN), and the data.frame specific version of map (see the documentation for the by_row function), but this may be overkill for what you're after.

We can use expandRows to expand the rows based on the value in the 'samples' column, then convert to data.table, grouped by 'chr', we paste the columns together along with sequence of rows using sprintf to update the 'samples' column.

library(splitstackshape)
setDT(expandRows(df, "samples"))[,
     samples := sprintf("%d-%d-%d-%s%d", chr, start, end, "s",1:.N) , chr][]
#  chr start end    samples
#1:   1    10  20 1-10-20-s1
#2:   1    10  20 1-10-20-s2
#3:   2     4  10  2-4-10-s1
#4:   2     4  10  2-4-10-s2
#5:   2     4  10  2-4-10-s3

NOTE: data.table will be loaded when we load splitstackshape.

Example using DataFrame function from S4Vector package:

df <- DataFrame(x=c('a', 'b', 'c', 'd', 'e'), y=1:5)
rep(df, df$y)

where y column represents the number of times to repeat its corresponding row.

Result:

DataFrame with 15 rows and 2 columns
              x         y
    <character> <integer>
1             a         1
2             b         2
3             b         2
4             c         3
5             c         3
...         ...       ...
11            e         5
12            e         5
13            e         5
14            e         5
15            e         5
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!