Concatenating data frame rows based on column condition

问题

For subsequent discussion, I will refer to the example data frame below:

Now, what I wish to achieve is to group all the packet times that are similar - i.e. all the 7s, 12s, etc. Furthermore, the PacketTime field should contain the difference in min and max (max(PacketTime) - min(PacketTime)), and the FrameLen, IPLen and TCPLen fields should be lists of all the values that correspond to the grouped time. For example for the 7s group, FrameLen would contain c(304, 276, 276).

My solution for the above is as follows:

df <- packets %>%
  group_by(round(PacketTime)) %>%
  summarise(
    PTime=max(PacketTime)-min(PacketTime),
    FLen=list(FrameLen),
    ILen=list(IPLen),
    Movement=0
  ) %>%
  rename(PacketTime=PTime) %>%
  rename(FrameLen=FLen) %>%
  rename(IPLen=ILen)
df$"round(PacketTime)" <- NULL # Remove the group_by

However, some of these crossover (i.e. 1480s also includes part of 1481s). The part here, which makes this a little easier (in some regard) is that each of the groups are separated by 5s timing window (via Python time.sleep(5)).

How can I achieve the previous result, but only relying on the 5s difference between the groups that also takes into account the crossover?

EDIT: As suggested by Ben, here is the dput() of my dataframe df[1:20,]:

structure(list(PacketTime = c(7.083779, 7.147268, 7.147462, 12.084768, 
12.153246, 12.153951, 17.095972, 17.159268, 17.159876, 22.11384, 
22.176926, 22.177467, 27.134427, 27.199108, 27.200064, 32.144442, 
32.208648, 32.20922, 37.144255, 37.205622), FrameLen = c(304L, 
276L, 276L, 304L, 276L, 276L, 304L, 276L, 276L, 304L, 276L, 276L, 
304L, 276L, 276L, 304L, 276L, 276L, 304L, 276L), IPLen = c(300L, 
272L, 272L, 300L, 272L, 272L, 300L, 272L, 272L, 300L, 272L, 272L, 
300L, 272L, 272L, 300L, 272L, 272L, 300L, 272L), TCPLen = c(260L, 
232L, 232L, 260L, 232L, 232L, 260L, 232L, 232L, 260L, 232L, 232L, 
260L, 232L, 232L, 260L, 232L, 232L, 260L, 232L), Movement = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA, 
20L), class = "data.frame")

回答1:

One approach is to use seq and cut. Create a sequence from your minimum to maximum times, every 5 seconds. Then, use cut to put your times in intervals. You can use the interval for the labels, for example: (7-12 sec) by omitting the labels argument. Or just use the lower time of the interval (7 sec) as done below.

library(tidyverse)

my_breaks <- seq(trunc(min(packets$PacketTime)), max(packets$PacketTime) + 5, 5)
packets$Interval <- cut(packets$PacketTime, breaks = my_breaks, labels = my_breaks[-length(my_breaks)], right = FALSE)

packets %>%
  group_by(Interval) %>%
  summarise(
    PTime=max(PacketTime)-min(PacketTime),
    FLen=list(FrameLen),
    ILen=list(IPLen),
    Movement=0
  ) %>%
  rename(PacketTime=PTime) %>%
  rename(FrameLen=FLen) %>%
  rename(IPLen=ILen)

Output

# A tibble: 7 x 5
  Interval PacketTime FrameLen  IPLen     Movement
  <fct>         <dbl> <list>    <list>       <dbl>
1 7            0.0637 <int [3]> <int [3]>        0
2 12           0.0692 <int [3]> <int [3]>        0
3 17           0.0639 <int [3]> <int [3]>        0
4 22           0.0636 <int [3]> <int [3]>        0
5 27           0.0656 <int [3]> <int [3]>        0
6 32           0.0648 <int [3]> <int [3]>        0
7 37           0.0614 <int [2]> <int [2]>        0

回答2:

Here is a base R solution using aggregate+ transform

u <- aggregate(
    . ~ PacketTime,
    transform(df,
        PTime = ave(PacketTime, trunc(PacketTime), 
        FUN = function(x) diff(range(x))), PacketTime = trunc(PacketTime)
    ),
    c
)
dfout <- transform(u, PTime = sapply(PTime, unique))

which gives

> dfout
  PacketTime      FrameLen         IPLen        TCPLen Movement    PTime
1          7 304, 276, 276 300, 272, 272 260, 232, 232  0, 0, 0 0.063683
2         12 304, 276, 276 300, 272, 272 260, 232, 232  0, 0, 0 0.069183
3         17 304, 276, 276 300, 272, 272 260, 232, 232  0, 0, 0 0.063904
4         22 304, 276, 276 300, 272, 272 260, 232, 232  0, 0, 0 0.063627
5         27 304, 276, 276 300, 272, 272 260, 232, 232  0, 0, 0 0.065637
6         32 304, 276, 276 300, 272, 272 260, 232, 232  0, 0, 0 0.064778
7         37      304, 276      300, 272      260, 232     0, 0 0.061367

来源：https://stackoverflow.com/questions/61715025/concatenating-data-frame-rows-based-on-column-condition

标签

dataframe