Geo distance calculation using SparkR

做~自己de王妃 提交于 2019-12-04 19:17:07

You cannot use standard R function directly on Spark DataFrames. If you use a recent Spark release you can you can use dapply but it is a bit verbose and slowish:

df <- createDataFrame(data.frame(
  lat1=c(23.123), lng1=c(24.234),  lat2=c(25.345),  lng2=c(26.456)))

new_schema <- do.call(
  structType, c(schema(df)$fields(), list(structField("dist", "double", TRUE))))

attach_dist <- function(df) {
  df$dist <- geosphere::distCosine(
    cbind(df$lng1, df$lat1), cbind(df$lng2, df$lat2))
  df
}

dapply(df, attach_dist, new_schema) %>% head()
    lat1   lng1   lat2   lng2     dist
1 23.123 24.234 25.345 26.456 334733.4

In practice I would rather use the formula directly. It will be much faster, all required functions are already available and it is not very complicated:

df %>% withColumn("dist", acos(
  sin(toRadians(df$lat1)) * sin(toRadians(df$lat2)) + 
  cos(toRadians(df$lat1)) * cos(toRadians(df$lat2)) * 
  cos(toRadians(df$lng1) - toRadians(df$lng2))
) * 6378137) %>% head()
    lat1   lng1   lat2   lng2     dist
1 23.123 24.234 25.345 26.456 334733.4
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!