find state name from lat-long in pyspark dataframe

我们两清 提交于 2020-07-23 06:06:07

问题


I have a pyspark data frame df which is holding large no of rows.Once of the columns is lat-long. I want to find the state name from the lat-long.I am using the below code

import reverse_geocoder as rg
new_df = df_new2.toPandas()
list_long_lat = a["lat_long"].tolist() 
result = rg.search(list_long_lat)
state_name=[]
for each_entry in result:
    state_name.append(each_entry["admin2"])
state_values = pd.Series(state_name)
a.insert(loc=0, column='State_name', value=state_values)

first of all when converting to pandas I am getting out of memory issue.Is there any way to efficiently find the state name with out even converting from pyspark data frame to pandas data frame considering the no of rows in input data frame is huge:1000000 Million


回答1:


Can you try by creating a udf

import reverse_geocoder as rg
import pyspark.sql.functions as f
map_state = f.udf(lambda x : rg.search(x)[0]['admin2'])
data.withColumn('State',map_state(f.col('lat_long'))).show()

The only drawback here is udf are not very fast also this will hit the api multiple times.




回答2:


Didn't do much pyspark, but pyspark's syntax is somewhat similar to pandas. Maybe give the following snippet a try.

search_state_udf = udf(lambda x: rg.search(x), StringType())

df.withColumn("state", search_state_udf(df.lat_long))

When the dataset is more than 1M records, looping the whole dataset is often not performant, you may want to have a look at apply to make it efficient.



来源:https://stackoverflow.com/questions/62675510/find-state-name-from-lat-long-in-pyspark-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!