问题
I am using the file beat and able to successfully push the logs to the elasticsearch in a particular index.
I have a use case where I need to find the duplicates in the logs, I tried using aggregation and I am able to find the duplicates in the logs for the exact log match like below,
2019-07-23 11:38:17,401 WARN [org.amazon.events] (default task-3) type=LOGIN_ERROR, realmId=amazon, clientId=angular-cors, userId=209fd7db-6964-41ff-bffd-3975ccbc03bb, ipAddress=44.44.44.44, error=invalid_user_credentials, auth_method=openid-connect, grant_type=password, client_auth_method=client-secret, username=testuser@amazon.com
2019-07-23 11:38:17,401 WARN [org.amazon.events] (default task-3) type=LOGIN_ERROR, realmId=amazon, clientId=angular-cors, userId=209fd7db-6964-41ff-bffd-3975ccbc03bb, ipAddress=44.44.44.44, error=invalid_user_credentials, auth_method=openid-connect, grant_type=password, client_auth_method=client-secret, username=testuser@amazon.com
But say the time and task id is changed like below, but still, I want to consider this as duplicate log as above
2019-07-23 11:38:18,401 WARN [org.amazon.events] (default task-4) type=LOGIN_ERROR, realmId=amazon, clientId=angular-cors, userId=209fd7db-6964-41ff-bffd-3975ccbc03bb, ipAddress=44.44.44.44, error=invalid_user_credentials, auth_method=openid-connect, grant_type=password, client_auth_method=client-secret, username=testuser@amazon.com
I have one way,
Solution :
i) If I use the standard analyzer with stopwords, I will be able to separate as tokens
ii) Skip the and GET only the in the tokens
iii) then use the multi-match / most-like-this query to check any logs already exist.
This is working as of now. But Is there a better way to get only the "keywords" from the logs using analyzer, so that I won't end up with a large sets of keywords.
Anyhelp is appreciated.
Thanks,
Harry
来源:https://stackoverflow.com/questions/62942422/elasticsearch-analyzer-for-parsing-the-application-logs