Repeat the rows in a data frame based on values in a specific column [duplicate]

This question already has an answer here:

Replicate each row of data.frame and specify the number of replications for each row 7 answers

I would like to repeat entire rows in a data-frame based on the samples column.

My input:

df <- 'chr start end samples
        1   10   20    2
        2   4    10    3'
df <- read.table(text=df, header=TRUE)

My expected output:

df <- 'chr start end  samples
        1   10   20   1-10-20-s1
        1   10   20   1-10-20-s2
        2   4    10   2-4-10-s1
        2   4    10   2-4-10-s2
        2   4    10   2-4-10-s3'

Some idea how to perform it wisely?

You can achieve this using base R (i.e. avoiding data.tables), with the following code:

df <- 'chr start end samples
        1   10   20    2
        2   4    10    3'

df <- read.table(text = df, header = TRUE)

duplicate_rows <- function(chr, starts, ends, samples) {
  expanded_samples <- paste0(chr, "-", starts, "-", ends, "-", "s", 1:samples)
  repeated_rows <- data.frame("chr" = chr, "starts" = starts, "ends" = ends, "samples" = expanded_samples)

  repeated_rows
}

expanded_rows <- Map(f = duplicate_rows, df$chr, df$start, df$end, df$samples)

new_df <- do.call(rbind, expanded_rows)

The basic idea is to define a function that will take a single row from your initial data.frame and duplicate rows based on the value in the samples column (as well as creating the distinct character strings you're after). This function is then applied to each row of your initial data.frame. The output is a list of data.frames that then need to be re-combined into a single data.frame using the do.call pattern.

The above code can be made cleaner by using the Hadley Wickham's purrr package (on CRAN), and the data.frame specific version of map (see the documentation for the by_row function), but this may be overkill for what you're after.

We can use expandRows to expand the rows based on the value in the 'samples' column, then convert to data.table, grouped by 'chr', we paste the columns together along with sequence of rows using sprintf to update the 'samples' column.

library(splitstackshape)
setDT(expandRows(df, "samples"))[,
     samples := sprintf("%d-%d-%d-%s%d", chr, start, end, "s",1:.N) , chr][]
#  chr start end    samples
#1:   1    10  20 1-10-20-s1
#2:   1    10  20 1-10-20-s2
#3:   2     4  10  2-4-10-s1
#4:   2     4  10  2-4-10-s2
#5:   2     4  10  2-4-10-s3

NOTE: data.table will be loaded when we load splitstackshape.

Example using DataFrame function from S4Vector package:

df <- DataFrame(x=c('a', 'b', 'c', 'd', 'e'), y=1:5)
rep(df, df$y)

where y column represents the number of times to repeat its corresponding row.

Result:

DataFrame with 15 rows and 2 columns
              x         y
    <character> <integer>
1             a         1
2             b         2
3             b         2
4             c         3
5             c         3
...         ...       ...
11            e         5
12            e         5
13            e         5
14            e         5
15            e         5

来源：https://stackoverflow.com/questions/38499032/repeat-the-rows-in-a-data-frame-based-on-values-in-a-specific-column

标签

repeat