Find all local maxima of a geom_smooth curve in R ggplot?

徘徊边缘 提交于 2021-02-19 07:39:49

问题


I need to find all local maxima of a geom_smooth() curve in R. This has been asked in Stack Overflow before:

How can I get the peak and valleys of a geom_smooth line in ggplot2?

But the answer related to finding a single maximum. What if there are multiple local maxima we want to find?

Here's some sample data:

library(tidyverse)

set.seed(404)
df <- data.frame(x = seq(0,4*pi,length.out=1000),
                 y = sin(seq(0,4*pi,length.out=1000))+rnorm(100,0,1))

df %>% ggplot(aes(x=x,y=y)) +
  geom_point() +
  geom_smooth()

To find a single maximum, we use the function underlying geom_smooth() in order to get the y values of the curve. This would be either gam() for 1000+ data points or loess() for fewer than 1000. In this case, it's gam() from library(mgcv). To find our maximum is a simple matter of subsetting with which.max(). We can plot the modeled y values over geom_smooth() to confirm they're the same, with our maximum represented by a vertical line:

library(mgcv)

df <- df %>% 
  mutate(smooth_y = predict(gam(y ~ s(x,bs="cs"),data=df)))

maximum <- df$x[which.max(df$smooth_y)]

df %>% ggplot() +
  geom_point(aes(x=x,y=y)) +
  geom_smooth(aes(x=x,y=y)) +
  geom_line(aes(x=x,y=smooth_y),size = 1.5, linetype = 2, col = "red")  +
  geom_vline(xintercept = maximum,color="green")

So far, so good. But, there is more than one maximum here. Maybe we're trying to find the periodicity of the sine wave, measured as the average distance between maxima. How do we make sure we find all maxima in the series?

I am posting my answer below, but I am wondering if there's a more elegant solution than the brute-force method I used.


回答1:


You can find the points where the difference between subsequent points flips sign using run-length encoding. Note that this method is approximate and relies on x being ordered. You can refine the locations by predicting more closely spaced x-values.

library(tidyverse)
library(mgcv)

set.seed(404)
df <- data.frame(x = seq(0,4*pi,length.out=1000),
                 y = sin(seq(0,4*pi,length.out=1000))+rnorm(100,0,1))

df <- df %>% 
  mutate(smooth_y = predict(gam(y ~ s(x,bs="cs"),data=df)))

# Run length encode the sign of difference
rle <- rle(diff(as.vector(df$smooth_y)) > 0)
# Calculate startpoints of runs
starts <- cumsum(rle$lengths) - rle$lengths + 1
# Take the points where the rle is FALSE (so difference goes from positive to negative) 
maxima_id <- starts[!rle$values]

# Also convenient, but not in the question:
# minima_id <- starts[rle$values]


maximum <- df$x[maxima_id]

df %>% ggplot() +
  geom_point(aes(x=x,y=y)) +
  geom_smooth(aes(x=x,y=y)) +
  geom_line(aes(x=x,y=smooth_y),size = 1.5, linetype = 2, col = "red")  +
  geom_vline(xintercept = maximum,color="green")
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Created on 2020-12-24 by the reprex package (v0.3.0)




回答2:


I went with a brute force, Monte Carlo method to solve the problem. Using replicate(), we try out 100 random ranges of x and find the maximum y value within each range. We reject maxima that occur at either end of the range. Then we find all unique values of the output vector:

maxima <- replicate(100,{
  x_range <- sample(df$x,size=2,replace=FALSE) %>% sort()
  max_loc <- df %>%
    filter(x >= x_range[1] & x <= x_range[2]) %>%
    filter(smooth_y == max(smooth_y)) %>%
    pull(x)
  if(max_loc == min(x_range)|max_loc == max(x_range)){NA}else{max_loc}
})
unique_maxima <- unique(maxima[!is.na(maxima)])

df %>% ggplot() +
  geom_point(aes(x=x,y=y)) +
  geom_smooth(aes(x=x,y=y)) +
  geom_line(aes(x=x,y=smooth_y),size = 1.5, linetype = 2, col = "red")  +
  geom_vline(xintercept = unique_maxima,color="green")



来源:https://stackoverflow.com/questions/65442224/find-all-local-maxima-of-a-geom-smooth-curve-in-r-ggplot

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!