find state name from lat-long in pyspark dataframe

问题

I have a pyspark data frame df which is holding large no of rows.Once of the columns is lat-long. I want to find the state name from the lat-long.I am using the below code

import reverse_geocoder as rg
new_df = df_new2.toPandas()
list_long_lat = a["lat_long"].tolist() 
result = rg.search(list_long_lat)
state_name=[]
for each_entry in result:
    state_name.append(each_entry["admin2"])
state_values = pd.Series(state_name)
a.insert(loc=0, column='State_name', value=state_values)

first of all when converting to pandas I am getting out of memory issue.Is there any way to efficiently find the state name with out even converting from pyspark data frame to pandas data frame considering the no of rows in input data frame is huge:1000000 Million

回答1:

Can you try by creating a udf

import reverse_geocoder as rg
import pyspark.sql.functions as f
map_state = f.udf(lambda x : rg.search(x)[0]['admin2'])
data.withColumn('State',map_state(f.col('lat_long'))).show()

The only drawback here is udf are not very fast also this will hit the api multiple times.

回答2:

Didn't do much pyspark, but pyspark's syntax is somewhat similar to pandas. Maybe give the following snippet a try.

search_state_udf = udf(lambda x: rg.search(x), StringType())

df.withColumn("state", search_state_udf(df.lat_long))

When the dataset is more than 1M records, looping the whole dataset is often not performant, you may want to have a look at apply to make it efficient.

来源：https://stackoverflow.com/questions/62675510/find-state-name-from-lat-long-in-pyspark-dataframe

标签

python-3.x

apache-spark

pyspark