waterdrop过滤处理日志文件,将数据入库
-
安装waterdrop
- 使用wget下载waterdrop 的安装包
wget xxxxx
- 解压至你需要的目录
这里如果unzip 报错,请自己下载对应命令。unzip xxx(包位置) xxx(解压位置)
- 在目录的 config 中设置依赖环境,java Spark
vim ./waterdrop-env.sh #!/usr/bin/env bash # Home directory of spark distribution. SPARK_HOME=/extends/soft/spark-2.4.4 JAVA_HOME=/usr/java/jdk1.8.0_202-amd64 HADOOP_HOME=/extends/soft/hadoop-2.7.4
- 进入一个config 复制之前的一个示例,进行修改。
- 使用wget下载waterdrop 的安装包
-
建立 config 配置文件进行处理数据
-
因为我这边的做的处理是读取log 日志。筛选有效数据。进行存入
-
配置文件贴出
###### ###### This config file is a demonstration of batch processing in waterdrop config ###### spark { # You can set spark configuration here # see available properties defined by spark: https://spark.apache.org/docs/latest/configuration.html#available-properties spark.app.name = "Waterdrop" spark.executor.instances = 2 spark.executor.cores = 1 spark.executor.memory = "1g" } input { # This is a example input plugin **only for test and demonstrate the feature input plugin** # fake { # result_table_name = "my_dataset" # } file { path = "file:///home/logs/gps.log" result_table_name = "gps" format = "text" } # You can also use other input plugins, such as hdfs # hdfs { # result_table_name = "accesslog" # path = "hdfs://hadoop-cluster-01/nginx/accesslog" # format = "json" # } # If you would like to get more information about how to configure waterdrop and see full list of input plugins, # please go to https://interestinglab.github.io/waterdrop/#/zh-cn/configuration/base } filter { # split data by specific delimiter # split { # fields = ["msg", "name"] # delimiter = " " # result_table_name = "accesslog" # } sql { sql = "select * from gps where raw_message like '%接收到的数据为:%' " } split { source_field = "raw_message" delimiter = "接收到的数据为:" fields = ["field1", "field2"] } json { source_field = "field2" result_table_name = "gps_test" } sql { sql = "select concat('',encrypt) as encrypt,`date` as up_date,concat('',lon) as lon,concat('',lat) as lat,concat('',vec1) as vec1,concat('',vec2) as vec2,concat('',vec3) as vec3,concat('',direction) as direction,concat('',altitude) as altitude,concat('',state) as state,concat('',alarm) as alarm,concat('',vehicleno) as vehicleno,concat('',vehiclecolor) as vehiclecolor,id,createBy as create_by,createDt as create_dt from gps_test where LENGTH(date) = 19" } # you can also you other filter plugins, such as sql # sql { # sql = "select * from accesslog where request_time > 1000" # } # If you would like to get more information about how to configure waterdrop and see full list of filter plugins, # please go to https://interestinglab.github.io/waterdrop/#/zh-cn/configuration/base } output { # choose stdout output plugin to output data to console # stdout { # } clickhouse { host = "hadoop-4:8123" clickhouse.socket_timeout = 50000 database = "wlpt_01" table = "t_plt_vehicle_location_test" fields = ["id","encrypt","up_date","lon","create_by","create_dt","lat","vec1","vec2","vec3","direction","altitude","state","alarm","vehicleno","vehiclecolor"] username = "default" password = "********" bulk_size = 5 retry = 3 } # you can also you other output plugins, such as sql # hdfs { # path = "hdfs://hadoop-cluster-01/nginx/accesslog_processed" # save_mode = "append" # } # If you would like to get more information about how to configure waterdrop and see full list of output plugins, # please go to https://interestinglab.github.io/waterdrop/#/zh-cn/configuration/base }
- 思路
- 日志数据格式
2020-01-03 13:36:23,967 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 30] head -> 91 ; tail -> 93 2020-01-03 13:36:23,992 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 65] xxxxxxxxx: 4608 ; ########: 1111 2020-01-03 13:36:23,993 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 39] msg -> Message [msgLength=90, encryptFlag=0, msgGesscenterId=1111, encryptKey=52559, crcCode=14014, msgId=4608, msgSn=61937, msgBody=BigEndianHeapChannelBuffer(ridx=0, widx=64, cap=64), versionFlag=[0, 0, 1]] 2020-01-03 13:36:23,993 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.BusiHandler [BusiHandler.java : 42] 接收定位数据 2020-01-03 13:36:23,995 [INFO] [New I/O server worker #1-8] org.jboss.netty.handler.logging.LoggingHandler [JdkLogger.java : 58] [id: 0x1e24397b, /10.228.30.192:59894 => /10.228.30.215:8082] RECEIVED: BigEndianHeapChannelBuffer(ridx=0, widx=115, cap=115) - (HEXDUMP: 5b0000007306ce969412000000045701020f0000000000c9c2483132353538000000000000000000000000000212010000003d3530373334000000000000000000000000000000000000000000000000000000000000000000000000000000000000000030313339393134363431383920d35d) 2020-01-03 13:36:23,996 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 30] head -> 91 ; tail -> 93 2020-01-03 13:36:23,996 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 65] xxxxxxxxx: 4608 ; ########: 1111 2020-01-03 13:36:23,996 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 39] msg -> Message [msgLength=115, encryptFlag=0, msgGesscenterId=1111, encryptKey=0, crcCode=8403, msgId=4608, msgSn=114202260, msgBody=BigEndianHeapChannelBuffer(ridx=0, widx=89, cap=89), versionFlag=[1, 2, 15]] 2020-01-03 13:36:23,996 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.BusiHandler [BusiHandler.java : 42] 接收定位数据 2020-01-03 13:36:23,996 [INFO] [New I/O server worker #1-8] org.jboss.netty.handler.logging.LoggingHandler [JdkLogger.java : 58] [id: 0x1e24397b, /10.228.30.192:59894 => /10.228.30.215:8082] RECEIVED: BigEndianHeapChannelBuffer(ridx=0, widx=91, cap=91) - (HEXDUMP: 5b0000005a0206ce969512000000045701020f0000000000c9c2483132353538000000000000000000000000000212020000002400030107e40d24100690c4100208318f00000000000168e300b500030000100300000000a78f5d) 2020-01-03 13:36:23,997 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 30] head -> 91 ; tail -> 93 2020-01-03 13:36:23,997 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 65] xxxxxxxxx: 4608 ; ########: 1111 2020-01-03 13:36:23,997 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.Decoder [Decoder.java : 39] msg -> Message [msgLength=90, encryptFlag=0, msgGesscenterId=1111, encryptKey=0, crcCode=42895, msgId=4608, msgSn=114202261, msgBody=BigEndianHeapChannelBuffer(ridx=0, widx=64, cap=64), versionFlag=[1, 2, 15]] 2020-01-03 13:36:23,997 [INFO] [New I/O server worker #1-8] com.xn.logistics.gps.server.BusiHandler [BusiHandler.java : 42] 接收定位数据 2020-01-03 13:36:23,997 [INFO] [Thread-36511178] com.xn.logistics.gps.server.BusiHandler [BusiHandler.java : 114] 接收到的数据为:{"encrypt":0,"date":"2020-01-03 13:36:16","lon":"1.1015067","lat":"3.409140","vec1":0,"vec2":0,"vec3":92387,"direction":181,"altitude":3,"state":4099,"alarm":0,"vehicleNo":"陕H12558","vehicleColor":2,"id":"MSG陕H12558","createBy":"UP_EXG_MSG_REAL_LOCATION","createDt":"2020-01-03 13:36:23"} 2020-01-03 13:36:24,006 [INFO] [New I/O server worker #1-8] org.jboss.netty.handler.logging.LoggingHandler [JdkLogger.java : 58] [id: 0x1e24397b, /10.228.30.192:59894 => /10.228.30.215:8082] RECEIVED: BigEndianHeapChannelBuffer(ridx=0, widx=115, cap=115) - (HEXDUMP: 5b00000073016a52f61200000004570100000000000000c9c2444235323136000000000000000000000000000212010000003d353039383900000000000000000000000000000000000000000000000000000000000000000000000000000000000000003031343239383831343430300b845d)
- 我这里需要的数据格式就是:{"encrypt":0,"date":"2020-01-03 13:36:16","lon":"1.1015067","lat":"3.409140","vec1":0,"vec2":0,"vec3":92387,"direction":181,"altitude":3,"state":4099,"alarm":0,"vehicleNo":"陕H12558","vehicleColor":2,"id":"MSG陕H12558","createBy":"UP_EXG_MSG_REAL_LOCATION","createDt":"2020-01-03 13:36:23"}
- 我这边需要获取到这个json ,然后解析,按照字段,一个个在放到数据库对应的表字段中。
- 最开始考虑的是正则进行过滤,过滤出接收到的数据为 后面的json,然后json进行解析入库。
- 使用 select * from gps raw_message rlike '{.*}$' 进行正则过滤的时候,报错,无法,询问作者Ricky Huo ,他这边建议我使用sql 的 like
- 正则无法继续使用,不能正常获取,尝试使用like ,这里需要注意,用法和mysql 里面的like 相似,使用 raw_message 让他针对每行 含有接收到的数据为:
sql { sql = "select * from gps where raw_message like '%接收到的数据为:%' " }
- 测试成功,这里筛选出来的数据为: 2020-01-03 13:36:23,997 [INFO] [Thread-36511178] com.xn.logistics.gps.server.BusiHandler [BusiHandler.java : 114] 接收到的数据为:{"encrypt":0,"date":"2020-01-03 13:36:16","lon":"1.1015067","lat":"3.409140","vec1":0,"vec2":0,"vec3":92387,"direction":181,"altitude":3,"state":4099,"alarm":0,"vehicleNo":"陕H12558","vehicleColor":2,"id":"MSG陕H12558","createBy":"UP_EXG_MSG_REAL_LOCATION","createDt":"2020-01-03 13:36:23"}
- 使用 split 进行分割,分割点为 接收到的数据为:这样最后把数据分割为了三部分,是下面这样。2020-01-03 13:36:23,997 [INFO] [Thread-36511178] com.xn.logistics.gps.server.BusiHandler [BusiHandler.java : 114] | 接收到的数据为:|{"encrypt":0,"date":"2020-01-03 13:36:16","lon":"1.1015067","lat":"3.409140","vec1":0,"vec2":0,"vec3":92387,"direction":181,"altitude":3,"state":4099,"alarm":0,"vehicleNo":"陕H12558","vehicleColor":2,"id":"MSG陕H12558","createBy":"UP_EXG_MSG_REAL_LOCATION","createDt":"2020-01-03 13:36:23"} 中间 | 为间隔符
split { source_field = "raw_message" delimiter = "接收到的数据为:" fields = ["field1", "field2"] }
- 这样,我们这边已经分割出来了json,在上一步split分割的时候,已经按照结果定义了对应的字段,参考我上面的那个配置文件查看。接下来,就是解析这个json。使用json
json { source_field = "field2" result_table_name = "gps_test" }
- 这里已经解析成功,并且得出了结果,但是我们需要一个新表,来承接之前的处理结果,所以我在上面result_table_name 那里进行了表的重新命名
-
-
运行,查看存储
- 最后,写output 进行写入,查看clickhouse 表,查看写入成功
在开始调试的时候,使用stdout 输出,这样每一步的结果,都在日志呈现。可以更加方便的进行调试。
output {
# choose stdout output plugin to output data to console
# stdout {
# }
}
waterdrop 文档 参考链接
这里感谢 waterdrop 的作者,在使用过程中的指导与帮助,谢谢Ricky Huo。
来源:oschina
链接:https://my.oschina.net/u/2971292/blog/3157715