问题
I have a dataframe of http request logs. The only relevant column is the userAgent column which I'm trying to parse. I'm using ua_parser. This turns each userAgent into a nested dictionary like so:
>>> from ua_parser import user_agent_parser
>>> user_agent_parser.Parse('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36')
{
'device': {'brand': None,
'model': None,
'family': 'Other'},
'os': {'major': '10',
'patch_minor': None,
'minor': '10',
'family': 'Mac OS X',
'patch': '5'},
'user_agent': {'major': '55',
'minor': '0',
'family': 'Chrome',
'patch': '2883'},
'string': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'
}
I'm trying to create 4 additional columns on my log dataframe using the results of user_agent_parser. I'd like device_brand, device_model, os_family, and user_agent_family columns.
Unfortunately, when I store this as a numpy array, I can't access the dictionary indices:
>>> parsed_ua = logs['userAgent'].apply(user_agent_parser.Parse)
>>> logs['device_brand'] = parsed_ua['device']['brand']
KeyError: 'device'
I tried converting this to a dataframe so I could merge parsed_ua with logs. Unfortunately, this writes each dictionary to a single column
>>> pd.DataFrame(parsed_ua)
userAgent
0 {u'device': {u'brand': None, u'model': None, u...
1 {u'device': {u'brand': None, u'model': None, u...
2 {u'device': {u'brand': None, u'model': None, u...
3 {u'device': {u'brand': None, u'model': None, u...
4 {u'device': {u'brand': None, u'model': None, u...
How can I parse the userAgent column and write the results to multiple columns?
回答1:
you can use json_normalize() method:
In [146]: pd.io.json.json_normalize(parsed_ua)
Out[146]:
device.brand device.family device.model os.family os.major os.minor \
0 None Other None Mac OS X 10 10
os.patch os.patch_minor string \
0 5 None Mozilla/5.0 (Macintosh; Intel Mac OS...
user_agent.family user_agent.major user_agent.minor user_agent.patch
0 Chrome 55 0 2883
回答2:
In addition to what you've done, you can use lambda of Series' apply:
ua = logs['userAgent'].apply(lambda ua: user_agent_parser.Parse(ua))
logs['device_brand'] = ua.apply(lambda x: x['device']['brand'])
logs['device_model'] = ua.apply(lambda x: x['device']['model'])
logs['os_family'] = ua.apply(lambda x: x['os']['family'])
logs['user_agent_family'] = ua.apply(lambda x: x['user_agent']['family'])
来源:https://stackoverflow.com/questions/41840862/pandas-parse-user-agent-column-into-multiple-columns