waterdrop过滤处理日志文件

waterdrop过滤处理日志文件，将数据入库

安装waterdrop

使用wget下载waterdrop 的安装包
```
wget xxxxx
```
解压至你需要的目录
```
unzip xxx(包位置)  xxx(解压位置) 
```
这里如果unzip 报错，请自己下载对应命令。

在目录的 config 中设置依赖环境，java Spark

vim ./waterdrop-env.sh


#!/usr/bin/env bash
# Home directory of spark distribution.

SPARK_HOME=/extends/soft/spark-2.4.4

JAVA_HOME=/usr/java/jdk1.8.0_202-amd64

HADOOP_HOME=/extends/soft/hadoop-2.7.4

进入一个config 复制之前的一个示例，进行修改。

建立 config 配置文件进行处理数据

因为我这边的做的处理是读取log 日志。筛选有效数据。进行存入
配置文件贴出

######
###### This config file is a demonstration of batch processing in waterdrop config
######

spark {
# You can set spark configuration here
# see available properties defined by spark: https://spark.apache.org/docs/latest/configuration.html#available-properties
spark.app.name = "Waterdrop"
spark.executor.instances = 2
spark.executor.cores = 1
spark.executor.memory = "1g"
}

input {
# This is a example input plugin **only for test and demonstrate the feature input plugin**
#  fake {
#    result_table_name = "my_dataset"
#  }

file {
    path = "file:///home/logs/gps.log"
    result_table_name = "gps"
    format = "text"
}

# You can also use other input plugins, such as hdfs
# hdfs {
#   result_table_name = "accesslog"
#   path = "hdfs://hadoop-cluster-01/nginx/accesslog"
#   format = "json"
# }

# If you would like to get more information about how to configure waterdrop and see full list of input plugins,
# please go to https://interestinglab.github.io/waterdrop/#/zh-cn/configuration/base
}

filter {
# split data by specific delimiter
#  split {
#    fields = ["msg", "name"]
#    delimiter = " "
#    result_table_name = "accesslog"
#  }


sql {
        sql = "select * from gps  where raw_message like '%接收到的数据为：%' "
    }

split {
    source_field = "raw_message"
    delimiter = "接收到的数据为："
    fields = ["field1", "field2"]
}

json {
    source_field = "field2"
    result_table_name = "gps_test"
}

sql {
        sql = "select concat('',encrypt) as encrypt,`date` as up_date,concat('',lon) as lon,concat('',lat) as lat,concat('',vec1) as vec1,concat('',vec2) as vec2,concat('',vec3) as vec3,concat('',direction) as direction,concat('',altitude) as altitude,concat('',state) as state,concat('',alarm) as alarm,concat('',vehicleno) as vehicleno,concat('',vehiclecolor) as vehiclecolor,id,createBy as create_by,createDt as create_dt  from gps_test  where LENGTH(date) = 19"
    }

# you can also you other filter plugins, such as sql
# sql {
#   sql = "select * from accesslog where request_time > 1000"
# }

# If you would like to get more information about how to configure waterdrop and see full list of filter plugins,
# please go to https://interestinglab.github.io/waterdrop/#/zh-cn/configuration/base
}

output {
# choose stdout output plugin to output data to console
#  stdout {
#  }

clickhouse {
        host = "hadoop-4:8123"
        clickhouse.socket_timeout = 50000
        database = "wlpt_01"
        table = "t_plt_vehicle_location_test"
        fields = ["id","encrypt","up_date","lon","create_by","create_dt","lat","vec1","vec2","vec3","direction","altitude","state","alarm","vehicleno","vehiclecolor"]
        username = "default"
        password = "********"
        bulk_size = 5
        retry = 3
    }



# you can also you other output plugins, such as sql
# hdfs {
#   path = "hdfs://hadoop-cluster-01/nginx/accesslog_processed"
#   save_mode = "append"
# }

# If you would like to get more information about how to configure waterdrop and see full list of output plugins,
# please go to https://interestinglab.github.io/waterdrop/#/zh-cn/configuration/base
}

思路

日志数据格式

2020-01-03 13:36:23,967 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 30] head -> 91 ; tail -> 93
2020-01-03 13:36:23,992 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 65] xxxxxxxxx： 4608 ; ########： 1111
2020-01-03 13:36:23,993 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 39] msg -> Message [msgLength=90, encryptFlag=0, msgGesscenterId=1111, encryptKey=52559, crcCode=14014, msgId=4608, msgSn=61937, msgBody=BigEndianHeapChannelBuffer(ridx=0, widx=64, cap=64), versionFlag=[0, 0, 1]]
2020-01-03 13:36:23,993 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.BusiHandler [BusiHandler.java : 42] 接收定位数据
2020-01-03 13:36:23,995 [INFO] [New I/O server worker #1-8] org.jboss.netty.handler.logging.LoggingHandler [JdkLogger.java : 58] [id: 0x1e24397b, /10.228.30.192:59894 => /10.228.30.215:8082] RECEIVED: BigEndianHeapChannelBuffer(ridx=0, widx=115, cap=115) - (HEXDUMP: 5b0000007306ce969412000000045701020f0000000000c9c2483132353538000000000000000000000000000212010000003d3530373334000000000000000000000000000000000000000000000000000000000000000000000000000000000000000030313339393134363431383920d35d)
2020-01-03 13:36:23,996 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 30] head -> 91 ; tail -> 93
2020-01-03 13:36:23,996 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 65] xxxxxxxxx： 4608 ; ########： 1111
2020-01-03 13:36:23,996 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 39] msg -> Message [msgLength=115, encryptFlag=0, msgGesscenterId=1111, encryptKey=0, crcCode=8403, msgId=4608, msgSn=114202260, msgBody=BigEndianHeapChannelBuffer(ridx=0, widx=89, cap=89), versionFlag=[1, 2, 15]]
2020-01-03 13:36:23,996 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.BusiHandler [BusiHandler.java : 42] 接收定位数据
2020-01-03 13:36:23,996 [INFO] [New I/O server worker #1-8] org.jboss.netty.handler.logging.LoggingHandler [JdkLogger.java : 58] [id: 0x1e24397b, /10.228.30.192:59894 => /10.228.30.215:8082] RECEIVED: BigEndianHeapChannelBuffer(ridx=0, widx=91, cap=91) - (HEXDUMP: 5b0000005a0206ce969512000000045701020f0000000000c9c2483132353538000000000000000000000000000212020000002400030107e40d24100690c4100208318f00000000000168e300b500030000100300000000a78f5d)
2020-01-03 13:36:23,997 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 30] head -> 91 ; tail -> 93
2020-01-03 13:36:23,997 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 65] xxxxxxxxx： 4608 ; ########： 1111
2020-01-03 13:36:23,997 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 39] msg -> Message [msgLength=90, encryptFlag=0, msgGesscenterId=1111, encryptKey=0, crcCode=42895, msgId=4608, msgSn=114202261, msgBody=BigEndianHeapChannelBuffer(ridx=0, widx=64, cap=64), versionFlag=[1, 2, 15]]
2020-01-03 13:36:23,997 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.BusiHandler [BusiHandler.java : 42] 接收定位数据
2020-01-03 13:36:23,997 [INFO] [Thread-36511178] com.xn.logistics.gps.server.BusiHandler [BusiHandler.java : 114] 接收到的数据为：{"encrypt":0,"date":"2020-01-03 13:36:16","lon":"1.1015067","lat":"3.409140","vec1":0,"vec2":0,"vec3":92387,"direction":181,"altitude":3,"state":4099,"alarm":0,"vehicleNo":"陕H12558","vehicleColor":2,"id":"MSG陕H12558","createBy":"UP_EXG_MSG_REAL_LOCATION","createDt":"2020-01-03 13:36:23"}
2020-01-03 13:36:24,006 [INFO] [New I/O server worker #1-8] org.jboss.netty.handler.logging.LoggingHandler [JdkLogger.java : 58] [id: 0x1e24397b, /10.228.30.192:59894 => /10.228.30.215:8082] RECEIVED: BigEndianHeapChannelBuffer(ridx=0, widx=115, cap=115) - (HEXDUMP: 5b00000073016a52f61200000004570100000000000000c9c2444235323136000000000000000000000000000212010000003d353039383900000000000000000000000000000000000000000000000000000000000000000000000000000000000000003031343239383831343430300b845d)

我这里需要的数据格式就是：{"encrypt":0,"date":"2020-01-03 13:36:16","lon":"1.1015067","lat":"3.409140","vec1":0,"vec2":0,"vec3":92387,"direction":181,"altitude":3,"state":4099,"alarm":0,"vehicleNo":"陕H12558","vehicleColor":2,"id":"MSG陕H12558","createBy":"UP_EXG_MSG_REAL_LOCATION","createDt":"2020-01-03 13:36:23"}
我这边需要获取到这个json ,然后解析，按照字段，一个个在放到数据库对应的表字段中。

最开始考虑的是正则进行过滤，过滤出接收到的数据为后面的json,然后json进行解析入库。
使用 select * from gps raw_message rlike '{.*}$' 进行正则过滤的时候，报错，无法，询问作者Ricky Huo ，他这边建议我使用sql 的 like
正则无法继续使用，不能正常获取，尝试使用like ,这里需要注意，用法和mysql 里面的like 相似，使用 raw_message 让他针对每行含有接收到的数据为：
```
 sql {
        sql = "select * from gps  where raw_message like '%接收到的数据为：%' "
    }
```
测试成功，这里筛选出来的数据为： 2020-01-03 13:36:23,997 [INFO] [Thread-36511178] com.xn.logistics.gps.server.BusiHandler [BusiHandler.java : 114] 接收到的数据为：{"encrypt":0,"date":"2020-01-03 13:36:16","lon":"1.1015067","lat":"3.409140","vec1":0,"vec2":0,"vec3":92387,"direction":181,"altitude":3,"state":4099,"alarm":0,"vehicleNo":"陕H12558","vehicleColor":2,"id":"MSG陕H12558","createBy":"UP_EXG_MSG_REAL_LOCATION","createDt":"2020-01-03 13:36:23"}
使用 split 进行分割，分割点为接收到的数据为：这样最后把数据分割为了三部分，是下面这样。2020-01-03 13:36:23,997 [INFO] [Thread-36511178] com.xn.logistics.gps.server.BusiHandler [BusiHandler.java : 114] | 接收到的数据为：|{"encrypt":0,"date":"2020-01-03 13:36:16","lon":"1.1015067","lat":"3.409140","vec1":0,"vec2":0,"vec3":92387,"direction":181,"altitude":3,"state":4099,"alarm":0,"vehicleNo":"陕H12558","vehicleColor":2,"id":"MSG陕H12558","createBy":"UP_EXG_MSG_REAL_LOCATION","createDt":"2020-01-03 13:36:23"} 中间 | 为间隔符
```
 split {
        source_field = "raw_message"
        delimiter = "接收到的数据为："
        fields = ["field1", "field2"]
    }
```
这样，我们这边已经分割出来了json,在上一步split分割的时候，已经按照结果定义了对应的字段，参考我上面的那个配置文件查看。接下来，就是解析这个json。使用json
```
 json {
    source_field = "field2"
    result_table_name = "gps_test"
 }
```
这里已经解析成功，并且得出了结果，但是我们需要一个新表，来承接之前的处理结果，所以我在上面result_table_name 那里进行了表的重新命名

运行，查看存储
- 最后，写output 进行写入，查看clickhouse 表，查看写入成功

在开始调试的时候，使用stdout 输出，这样每一步的结果，都在日志呈现。可以更加方便的进行调试。

output {
  # choose stdout output plugin to output data to console
#  stdout {
#  }
}

waterdrop 文档参考链接

这里感谢 waterdrop 的作者，在使用过程中的指导与帮助，谢谢Ricky Huo。

来源：oschina

链接：https://my.oschina.net/u/2971292/blog/3157715

标签

日志文件

info