Separating data into bins and calculating averages

问题

I have this sample data

Time(s) Bacteria count
0.4 2
0.82    5
6.67    8
7.55    11
8.21    14
8.89    17
9.4 20
10.18   23
10.85   26
11.35   29
11.85   32
12.41   35
13.36   38
13.86   41
14.57   44
15.08   47
15.67   50
16.09   53
16.59   56
18.53   59
24.43   62
25.32   65
25.97   68
26.37   71
26.93   74
27.87   77
28.33   80
29.1    83
29.88   84
30.88   85
31.99   86
35.65   87
36.06   88
36.46   89
36.96   90
37.39   91
37.95   92
38.56   93
39.22   94
39.79   95
40.56   96
41.47   97
42.02   98
42.73   99
43.4    100
43.93   101
44.67   102
45.24   103
45.9    104
46.58   105
47.22   106
47.89   107
48.64   108
49.13   109
49.91   110
50.48   111
51.25   112
53.35   113
53.98   114
54.69   115
55.82   116
56.38   117
56.99   118
62.09   119
63.1    120
63.84   121
64.64   122
65.37   123
66.61   124
69  125
69.72   126
70.78   126
73.32   126
74.65   126
75.12   126
75.45   126
75.94   126
76.38   126
76.84   126
77.95   126
78.61   126
79.06   126
79.62   126
80.19   126
82.73   126
85.3    126
85.68   126
86.42   126
87.41   126
88.08   126
91.74   126
92.81   126
93.21   126
94.32   126
96.32   126
102.03  126
102.71  126
104.45  126
105.04  126
105.65  126
106.16  126
107.44  126
107.9   126
109.72  126
110.24  126
111.24  126
111.84  126
112.45  126
113.12  126
114.02  126
114.67  126
115.24  126
115.85  126
117 126
121.26  126
121.8   126
125.8   126
127.26  126
128.37  126
129.48  126
130.27  126
131.04  126
131.72  126
132.47  126
133.21  126
134.27  126
134.87  126
136.04  126
136.6   126
137.27  126
140.83  126
142.05  126
143.63  126
144.12  126
149.83  126
151.07  126
151.79  126
153.24  126
154.14  126
155.24  126
156.58  126
157.51  126
158.25  126
161.43  126
162.14  126
162.8   126
164.26  126
165.09  126
165.76  126
166.83  126
167.42  126
168.94  126
169.75  126
170.52  126
171.19  126
172.67  126
173.44  126

So i have this data from Time (0 s) till Time (2000 s) and this program we are using calculates the number of bacteria in a dish whenever it multiplies or if it doesn't...it doesn't print out anything so it basically skips the times where it has not detected anything. So I really want to use R to separate the data in 30 second intervals. I want R to calculate the average number of bacteria spores every 30 seconds. How would I go about doing this?

回答1:

I did a bit of modelling. I've made some assumptions. I've modelled this system as though you start off with 126 bacteria and each has a probability of becoming 'active'. At the end of the trial, all bacteria are 'active'. I've called your data bacteria

bacteria.glm <- glm(cbind(Bacteria_count, 126 - Bacteria_count) ~ Time, 
                    data=bacteria, family=binomial(logit))

plot(Bacteria_count/126 ~ Time, data=bacteria)
lines(bacteria$Time, bacteria.glm$fitted, col="red")

Given this, we can interpolate at 30 second intervals:

bacteria_intervals <- seq(0, 173.44, 30)
bac_predict<-data.frame(Time=bacteria_intervals, 
                        Bacteria_count=predict(bacteria.glm, data.frame(Time=bacteria_intervals), 
                                               type="response")*126)

plot(bacteria)
points(Bacteria_count~Time, data=bac_predict, col="red", pch=16)

bac_predict
##   Time Bacteria_count
## 1    0       12.39587
## 2   30       76.11856
## 3   60      120.36021
## 4   90      125.57925
## 5  120      125.96982
## 6  150      125.99784

Alternatively, for linear interpolation:

bacteria_linear <- approx(bacteria, xout=seq(0, 173.44, 30))
setNames(as.data.frame(bacteria_linear), c("Time", "Bacteria_count"))
##   Time Bacteria_count
## 1    0             NA
## 2   30        84.1200
## 3   60       118.5902
## 4   90       126.0000
## 5  120       126.0000
## 6  150       126.0000

Or even spline interpolation:

bacteria_spline <- spline(bacteria, xout=seq(0, 173.44, 30))
setNames(as.data.frame(bacteria_spline), c("Time", "Bacteria_count"))
##   Time Bacteria_count
## 1    0      -1.672644
## 2   30      84.110483
## 3   60     118.854542
## 4   90     126.000000
## 5  120     126.000000
## 6  150     126.000000

回答2:

Totally naive modelling just taking the average of the last value in a 30 second block, and the first value of the next 30 second block:

Get the data:

test <- read.table(text="time bacteria 
0.4 2
0.82 5
6.67 8
7.55 11
8.21 14
8.89 17
9.4 20
10.18 23
10.85 26
11.35 29
11.85 32
12.41 35
13.36 38
13.86 41
14.57 44
15.08 47
15.67 50
16.09 53
16.59 56
18.53 59
24.43 62
25.32 65
25.97 68
26.37 71
26.93 74
27.87 77
28.33 80
29.1 83
29.88 84
30.88 85
31.99 86
35.65 87
36.06 88
36.46 89
36.96 90
37.39 91
37.95 92
38.56 93
39.22 94
39.79 95
40.56 96
41.47 97
42.02 98
42.73 99
43.4 100
43.93 101
44.67 102
45.24 103
45.9 104
46.58 105
47.22 106
47.89 107
48.64 108
49.13 109
49.91 110
50.48 111
51.25 112
53.35 113
53.98 114
54.69 115
55.82 116
56.38 117
56.99 118
62.09 119
63.1 120
63.84 121
64.64 122
65.37 123
66.61 124
69 125
69.72 126
70.78 126
73.32 126
74.65 126
75.12 126
75.45 126
75.94 126
76.38 126
76.84 126
77.95 126
78.61 126
79.06 126
79.62 126
80.19 126
82.73 126
85.3 126
85.68 126
86.42 126
87.41 126
88.08 126
91.74 126
92.81 126
93.21 126
94.32 126
96.32 126
102.03 126
102.71 126
104.45 126
105.04 126
105.65 126
106.16 126
107.44 126
107.9 126
109.72 126
110.24 126
111.24 126
111.84 126
112.45 126
113.12 126
114.02 126
114.67 126
115.24 126
115.85 126
117 126
121.26 126
121.8 126
125.8 126
127.26 126
128.37 126
129.48 126
130.27 126
131.04 126
131.72 126
132.47 126
133.21 126
134.27 126
134.87 126
136.04 126
136.6 126
137.27 126
140.83 126
142.05 126
143.63 126
144.12 126
149.83 126
151.07 126
151.79 126
153.24 126
154.14 126
155.24 126
156.58 126
157.51 126
158.25 126
161.43 126
162.14 126
162.8 126
164.26 126
165.09 126
165.76 126
166.83 126
167.42 126
168.94 126
169.75 126
170.52 126
171.19 126
172.67 126
173.44 126",header=TRUE)

Find the blocks and take the averages:

test$block <- findInterval(test$time,seq(0,max(test$time),30))

apply(
  rbind(
    tapply(test$bacteria,test$block,max),
    c(tail(tapply(test$bacteria,test$block,min),-1),NA)
  ),
  2,
  mean,
  na.rm=TRUE
)

Result:

    1     2     3     4     5     6 
 84.5 118.5 126.0 126.0 126.0 126.0

The caveat would be that this system will only work with testing that happens with a relatively frequent re-sampling. Big gaps will throw the result way out and you'd be better off with a more sophisticated solution.

来源：https://stackoverflow.com/questions/14949819/separating-data-into-bins-and-calculating-averages

标签

interpolation

average

bins