spark streaming 使用geoIP解析IP

馋奶兔 提交于 2020-03-03 09:41:54

1、首先将GEOIP放到服务器上,如,/opt/db/geo/GeoLite2-City.mmdb

2、新建scala sbt工程,测试是否可以顺利解析

import java.io.Fileimport java.net.InetAddressimport com.maxmind.db.CHMCacheimport com.maxmind.geoip2.DatabaseReaderimport org.json4s.DefaultFormats/** * Created by zxh on 2016/7/17. */object test {  implicit val formats = DefaultFormats  def main(args: Array[String]): Unit = {    val url = "F:\\Code\\OpenSource\\Data\\spark-sbt\\src\\main\\resources\\GeoLite2-City.mmdb"    //    val url2 = "/opt/db/geo/GeoLite2-City.mmdb"    val geoDB = new File(url);    geoDB.exists()    val geoIPResolver = new DatabaseReader.Builder(geoDB).withCache(new CHMCache()).build();    val ip = "222.173.17.203"    val inetAddress = InetAddress.getByName(ip)    val geoResponse = geoIPResolver.city(inetAddress)    val (country, province, city) = (geoResponse.getCountry.getNames.get("zh-CN"), geoResponse.getSubdivisions.get(0).getNames().get("zh-CN"), geoResponse.getCity.getNames.get("zh-CN"))    println(s"country:$country,province:$province,city:$city")  }}
build.sbt 内容如下
import AssemblyKeys._
assemblySettings
mergeStrategy in assembly <<= (mergeStrategy in assembly) { mergeStrategy =>
{
  case entry => {
    val strategy = mergeStrategy(entry)
    if (strategy == MergeStrategy.deduplicate) MergeStrategy.first
    else strategy
  }
}
}
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)
name := "scala_sbt"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "com.maxmind.geoip2" % "geoip2" % "2.5.0"

  将该程序打包,放到服务器上,执行scala -cp ./scala_sbt-assembly-1.0.jar test,解析结果如下

country:中国,province:山东省,city:济南

3、编写streaming程序

import java.io.Fileimport java.net.InetAddressimport com.maxmind.db.CHMCacheimport com.maxmind.geoip2.DatabaseReaderimport com.maxmind.geoip2.model.CityResponseimport org.apache.spark.rdd.RDDimport org.apache.spark.streaming.{Time, Seconds, StreamingContext}import org.apache.spark.{SparkContext, SparkConf}/** * Created by zxh on 2016/7/17. */object geoip {  def main(args: Array[String]): Unit = {    val conf = new SparkConf().setAppName("geoip_test").setMaster("local[2]")    val sc = new SparkContext(conf)    val ssc = new StreamingContext(sc, Seconds(10))    val lines = ssc.socketTextStream("localhost", 9999)    lines.foreachRDD((rdd: RDD[String], t: Time) => {      rdd.foreachPartition(p => {        val url2 = "/opt/db/geo/GeoLite2-City.mmdb"        val geoDB = new File(url2);        val geoIPResolver = new DatabaseReader.Builder(geoDB).withCache(new CHMCache()).build();        def resolve_ip(resp: CityResponse): (String, String, String) = {          (resp.getCountry.getNames.get("zh-CN"), resp.getSubdivisions.get(0).getNames().get("zh-CN"), resp.getCity.getNames.get("zh-CN"))        }        p.foreach(x => {          if (x != None && x != null && x != "") {            val inetAddress = InetAddress.getByName(x)            val geoResponse = geoIPResolver.city(inetAddress)            println(resolve_ip(geoResponse))          }        })      })    })    ssc.start  }}build.sbtlibraryDependencies += "com.maxmind.geoip2" % "geoip2" % "2.5.0"

注意:红色部分需要放到foreachPartition内部,原因如下:

1、减少加载文件次数,一个Partition只加载一次

2、resolve_ip 函数参数为CityResponse,此参数不可序列化,所以要在Partition内部,这样就不会在节点之间序列化传输

3、com.maxmind.geoip2 版本需要是 2.5.0,以便和spark本身兼容,否则会报错如下:

val geoIPResolver = new DatabaseReader.Builder(geoDB).withCache(new CHMCache()).build();
java.lang.NoSuchMethodError: com.fasterxml.jackson.databind.node.ArrayNode.<init>(Lcom/fasterxml/jackson/databind/node/JsonNodeFactory;Ljava/util/List;)V

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!