Geoip2's python library doesn't work in pySpark's map function

二次信任 提交于 2019-11-30 20:11:39

It looks like the problem with your code comes from a reader object. It cannot be correctly serialized as a part of a closure and send to the workers.To deal with this you have instantiate it on the workers. One way you can handle this is to use mapPartitions:

from pyspark import SparkFiles

geoDBpath = 'GeoLite2-City.mmdb'
sc.addFile(geoDBpath)

def partitionIp2city(iter):
    from geoip2 import database

    def ip2city(ip):
        try:
           city = reader.city(ip).city.name
        except:
            city = 'not found'
        return city

    reader = database.Reader(SparkFiles.get(geoDBpath))
    return [ip2city(ip) for ip in iter]

rdd = sc.parallelize(['128.101.101.101', '85.25.43.84'])
rdd.mapPartitions(partitionIp2city).collect()

## ['Minneapolis', None]
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!