waterdrop过滤处理日志文件

眉间皱痕 提交于 2020-02-26 02:25:54

waterdrop过滤处理日志文件,将数据入库

  • 安装waterdrop

    • 使用wget下载waterdrop 的安装包
      wget xxxxx
      
    • 解压至你需要的目录
      unzip xxx(包位置)  xxx(解压位置) 
      
      这里如果unzip 报错,请自己下载对应命令。
    • 在目录的 config 中设置依赖环境,java Spark
      vim ./waterdrop-env.sh
      
      
      #!/usr/bin/env bash
      # Home directory of spark distribution.
      
      SPARK_HOME=/extends/soft/spark-2.4.4
      
      JAVA_HOME=/usr/java/jdk1.8.0_202-amd64
      
      HADOOP_HOME=/extends/soft/hadoop-2.7.4
      
      
    • 进入一个config 复制之前的一个示例,进行修改。
  • 建立 config 配置文件进行处理数据

    • 因为我这边的做的处理是读取log 日志。筛选有效数据。进行存入

    • 配置文件贴出

    ######
    ###### This config file is a demonstration of batch processing in waterdrop config
    ######
    
    spark {
    # You can set spark configuration here
    # see available properties defined by spark: https://spark.apache.org/docs/latest/configuration.html#available-properties
    spark.app.name = "Waterdrop"
    spark.executor.instances = 2
    spark.executor.cores = 1
    spark.executor.memory = "1g"
    }
    
    input {
    # This is a example input plugin **only for test and demonstrate the feature input plugin**
    #  fake {
    #    result_table_name = "my_dataset"
    #  }
    
    file {
        path = "file:///home/logs/gps.log"
        result_table_name = "gps"
        format = "text"
    }
    
    # You can also use other input plugins, such as hdfs
    # hdfs {
    #   result_table_name = "accesslog"
    #   path = "hdfs://hadoop-cluster-01/nginx/accesslog"
    #   format = "json"
    # }
    
    # If you would like to get more information about how to configure waterdrop and see full list of input plugins,
    # please go to https://interestinglab.github.io/waterdrop/#/zh-cn/configuration/base
    }
    
    filter {
    # split data by specific delimiter
    #  split {
    #    fields = ["msg", "name"]
    #    delimiter = " "
    #    result_table_name = "accesslog"
    #  }
    
    
    sql {
            sql = "select * from gps  where raw_message like '%接收到的数据为:%' "
        }
    
    split {
        source_field = "raw_message"
        delimiter = "接收到的数据为:"
        fields = ["field1", "field2"]
    }
    
    json {
        source_field = "field2"
        result_table_name = "gps_test"
    }
    
    sql {
            sql = "select concat('',encrypt) as encrypt,`date` as up_date,concat('',lon) as lon,concat('',lat) as lat,concat('',vec1) as vec1,concat('',vec2) as vec2,concat('',vec3) as vec3,concat('',direction) as direction,concat('',altitude) as altitude,concat('',state) as state,concat('',alarm) as alarm,concat('',vehicleno) as vehicleno,concat('',vehiclecolor) as vehiclecolor,id,createBy as create_by,createDt as create_dt  from gps_test  where LENGTH(date) = 19"
        }
    
    # you can also you other filter plugins, such as sql
    # sql {
    #   sql = "select * from accesslog where request_time > 1000"
    # }
    
    # If you would like to get more information about how to configure waterdrop and see full list of filter plugins,
    # please go to https://interestinglab.github.io/waterdrop/#/zh-cn/configuration/base
    }
    
    output {
    # choose stdout output plugin to output data to console
    #  stdout {
    #  }
    
    clickhouse {
            host = "hadoop-4:8123"
            clickhouse.socket_timeout = 50000
            database = "wlpt_01"
            table = "t_plt_vehicle_location_test"
            fields = ["id","encrypt","up_date","lon","create_by","create_dt","lat","vec1","vec2","vec3","direction","altitude","state","alarm","vehicleno","vehiclecolor"]
            username = "default"
            password = "********"
            bulk_size = 5
            retry = 3
        }
    
    
    
    # you can also you other output plugins, such as sql
    # hdfs {
    #   path = "hdfs://hadoop-cluster-01/nginx/accesslog_processed"
    #   save_mode = "append"
    # }
    
    # If you would like to get more information about how to configure waterdrop and see full list of output plugins,
    # please go to https://interestinglab.github.io/waterdrop/#/zh-cn/configuration/base
    }
    

    • 思路
      • 日志数据格式
      2020-01-03 13:36:23,967 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 30] head -> 91 ; tail -> 93
      2020-01-03 13:36:23,992 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 65] xxxxxxxxx: 4608 ; ########: 1111
      2020-01-03 13:36:23,993 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 39] msg -> Message [msgLength=90, encryptFlag=0, msgGesscenterId=1111, encryptKey=52559, crcCode=14014, msgId=4608, msgSn=61937, msgBody=BigEndianHeapChannelBuffer(ridx=0, widx=64, cap=64), versionFlag=[0, 0, 1]]
      2020-01-03 13:36:23,993 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.BusiHandler [BusiHandler.java : 42] 接收定位数据
      2020-01-03 13:36:23,995 [INFO] [New I/O server worker #1-8] org.jboss.netty.handler.logging.LoggingHandler [JdkLogger.java : 58] [id: 0x1e24397b, /10.228.30.192:59894 => /10.228.30.215:8082] RECEIVED: BigEndianHeapChannelBuffer(ridx=0, widx=115, cap=115) - (HEXDUMP: 5b0000007306ce969412000000045701020f0000000000c9c2483132353538000000000000000000000000000212010000003d3530373334000000000000000000000000000000000000000000000000000000000000000000000000000000000000000030313339393134363431383920d35d)
      2020-01-03 13:36:23,996 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 30] head -> 91 ; tail -> 93
      2020-01-03 13:36:23,996 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 65] xxxxxxxxx: 4608 ; ########: 1111
      2020-01-03 13:36:23,996 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 39] msg -> Message [msgLength=115, encryptFlag=0, msgGesscenterId=1111, encryptKey=0, crcCode=8403, msgId=4608, msgSn=114202260, msgBody=BigEndianHeapChannelBuffer(ridx=0, widx=89, cap=89), versionFlag=[1, 2, 15]]
      2020-01-03 13:36:23,996 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.BusiHandler [BusiHandler.java : 42] 接收定位数据
      2020-01-03 13:36:23,996 [INFO] [New I/O server worker #1-8] org.jboss.netty.handler.logging.LoggingHandler [JdkLogger.java : 58] [id: 0x1e24397b, /10.228.30.192:59894 => /10.228.30.215:8082] RECEIVED: BigEndianHeapChannelBuffer(ridx=0, widx=91, cap=91) - (HEXDUMP: 5b0000005a0206ce969512000000045701020f0000000000c9c2483132353538000000000000000000000000000212020000002400030107e40d24100690c4100208318f00000000000168e300b500030000100300000000a78f5d)
      2020-01-03 13:36:23,997 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 30] head -> 91 ; tail -> 93
      2020-01-03 13:36:23,997 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 65] xxxxxxxxx: 4608 ; ########: 1111
      2020-01-03 13:36:23,997 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 39] msg -> Message [msgLength=90, encryptFlag=0, msgGesscenterId=1111, encryptKey=0, crcCode=42895, msgId=4608, msgSn=114202261, msgBody=BigEndianHeapChannelBuffer(ridx=0, widx=64, cap=64), versionFlag=[1, 2, 15]]
      2020-01-03 13:36:23,997 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.BusiHandler [BusiHandler.java : 42] 接收定位数据
      2020-01-03 13:36:23,997 [INFO] [Thread-36511178] com.xn.logistics.gps.server.BusiHandler [BusiHandler.java : 114] 接收到的数据为:{"encrypt":0,"date":"2020-01-03 13:36:16","lon":"1.1015067","lat":"3.409140","vec1":0,"vec2":0,"vec3":92387,"direction":181,"altitude":3,"state":4099,"alarm":0,"vehicleNo":"陕H12558","vehicleColor":2,"id":"MSG陕H12558","createBy":"UP_EXG_MSG_REAL_LOCATION","createDt":"2020-01-03 13:36:23"}
      2020-01-03 13:36:24,006 [INFO] [New I/O server worker #1-8] org.jboss.netty.handler.logging.LoggingHandler [JdkLogger.java : 58] [id: 0x1e24397b, /10.228.30.192:59894 => /10.228.30.215:8082] RECEIVED: BigEndianHeapChannelBuffer(ridx=0, widx=115, cap=115) - (HEXDUMP: 5b00000073016a52f61200000004570100000000000000c9c2444235323136000000000000000000000000000212010000003d353039383900000000000000000000000000000000000000000000000000000000000000000000000000000000000000003031343239383831343430300b845d)
      
      • 我这里需要的数据格式就是:{"encrypt":0,"date":"2020-01-03 13:36:16","lon":"1.1015067","lat":"3.409140","vec1":0,"vec2":0,"vec3":92387,"direction":181,"altitude":3,"state":4099,"alarm":0,"vehicleNo":"陕H12558","vehicleColor":2,"id":"MSG陕H12558","createBy":"UP_EXG_MSG_REAL_LOCATION","createDt":"2020-01-03 13:36:23"}
      • 我这边需要获取到这个json ,然后解析,按照字段,一个个在放到数据库对应的表字段中。
    • 最开始考虑的是正则进行过滤,过滤出接收到的数据为 后面的json,然后json进行解析入库。
    • 使用 select * from gps raw_message rlike '{.*}$' 进行正则过滤的时候,报错,无法,询问作者Ricky Huo ,他这边建议我使用sql 的 like
    • 正则无法继续使用,不能正常获取,尝试使用like ,这里需要注意,用法和mysql 里面的like 相似,使用 raw_message 让他针对每行 含有接收到的数据为:
       sql {
              sql = "select * from gps  where raw_message like '%接收到的数据为:%' "
          }
      
    • 测试成功,这里筛选出来的数据为: 2020-01-03 13:36:23,997 [INFO] [Thread-36511178] com.xn.logistics.gps.server.BusiHandler [BusiHandler.java : 114] 接收到的数据为:{"encrypt":0,"date":"2020-01-03 13:36:16","lon":"1.1015067","lat":"3.409140","vec1":0,"vec2":0,"vec3":92387,"direction":181,"altitude":3,"state":4099,"alarm":0,"vehicleNo":"陕H12558","vehicleColor":2,"id":"MSG陕H12558","createBy":"UP_EXG_MSG_REAL_LOCATION","createDt":"2020-01-03 13:36:23"}
    • 使用 split 进行分割,分割点为 接收到的数据为:这样最后把数据分割为了三部分,是下面这样。2020-01-03 13:36:23,997 [INFO] [Thread-36511178] com.xn.logistics.gps.server.BusiHandler [BusiHandler.java : 114] | 接收到的数据为:|{"encrypt":0,"date":"2020-01-03 13:36:16","lon":"1.1015067","lat":"3.409140","vec1":0,"vec2":0,"vec3":92387,"direction":181,"altitude":3,"state":4099,"alarm":0,"vehicleNo":"陕H12558","vehicleColor":2,"id":"MSG陕H12558","createBy":"UP_EXG_MSG_REAL_LOCATION","createDt":"2020-01-03 13:36:23"} 中间 | 为间隔符
       split {
              source_field = "raw_message"
              delimiter = "接收到的数据为:"
              fields = ["field1", "field2"]
          }
      
    • 这样,我们这边已经分割出来了json,在上一步split分割的时候,已经按照结果定义了对应的字段,参考我上面的那个配置文件查看。接下来,就是解析这个json。使用json
       json {
          source_field = "field2"
          result_table_name = "gps_test"
       }
      
    • 这里已经解析成功,并且得出了结果,但是我们需要一个新表,来承接之前的处理结果,所以我在上面result_table_name 那里进行了表的重新命名
  • 运行,查看存储

    • 最后,写output 进行写入,查看clickhouse 表,查看写入成功

在开始调试的时候,使用stdout 输出,这样每一步的结果,都在日志呈现。可以更加方便的进行调试。

output {
  # choose stdout output plugin to output data to console
#  stdout {
#  }
}

waterdrop 文档 参考链接

这里感谢 waterdrop 的作者,在使用过程中的指导与帮助,谢谢Ricky Huo。

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!