User Agent Strings: What should I capture?

a 夏天 提交于 2019-12-08 03:10:10

问题


I have a project where I'm parsing a Tomcat log for useful information. At first, the program was quite simple and could easily be done with a single grep. However, once it was realized that this information could be useful, I've been requested to do more and more complex parsing.

It's gotten to the point where I want to store the generic information of the log entry in a database, and then do various queries to get the customized report. Most of this information is fairly straight forward and easy to parse for.

  • There's the IP address and the proxy addresses. For example, I now have a list of 10,000 IP addresses from a firm we hired to test our app security. Thus, we want to ignore entries from these IP addresses in any report.
  • There's the Session ID: I need a report where the user went to Page "A" before Page "B", but not after a vist to page "C". Session ID allows me to track this behavior.
  • There's the time and date.
  • There's the HTTP Response code.
  • There's whether this is a GET or POST action.
  • There's the webpage itself (and possible values passed via GET).

And, finally there's the User Agent Mess... I mean User Agent String.

The User Agent String seems to have a rather loose layout. For example, 99% of them start with Mozilla/4.0 even though most of these are from browsers have nothing to do with Mozilla, Netscape, or Firefox and don't even use Gecko layout engine.

Unfortunately, the User Agent string is becoming rather important in our reports. For example, we need to know how many people are using Safari or using any Mobile browser or are on a Linux based system vs. Windows vs. iOS.

The big problem is I have no idea what might be requested in the future, so I am not 100% sure what information is useful and what isn't useful (Looks like 99.7% of our users are using Mozilla 4.0 browsers!).

So, how would you parse the User Agent String and pull out useful information that I could produce a report on?


回答1:


I'd start another table: UserAgentString, (id, UAS, num_uses, mobile, );

When parsing the log, store a UserAgentString.id for each log entry in your tables instead of the raw user agent string and add new strings to the UserAgentString table.

Each time you find a duplicate, increment the num_uses column. Set the mobile column if (when) you know it represents a mobile device. Add other columns and set them to represent other attributes of the uas as you discover that they are useful.



来源:https://stackoverflow.com/questions/15140416/user-agent-strings-what-should-i-capture

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!