Construct pandas DataFrame from items in nested dictionary

匿名 (未验证) 提交于 2019-12-03 02:11:02

问题:

Suppose I have a nested dictionary 'user_dict' with structure:

Level 1: UserId (Long Integer)

Level 2: Category (String)

Level 3: Assorted Attributes (floats, ints, etc..)

For example, an entry of this dictionary would be:

user_dict[12] = {     "Category 1": {"att_1": 1,                     "att_2": "whatever"},     "Category 2": {"att_1": 23,                     "att_2": "another"}} 

each item in "user_dict" has the same structure and "user_dict" contains a large number of items which I want to feed to a pandas DataFrame, constructing the series from the attributes. In this case a hierarchical index would be useful for the purpose.

Specifically, my question is whether there exists a way to to help the DataFrame constructor understand that the series should be built from the values of the "level 3" in the dictionary?

If I try something like:

df = pandas.DataFrame(users_summary) 

The items in "level 1" (the user id's) are taken as columns, which is the opposite of what I want to achieve (have user id's as index).

I know I could construct the series after iterating over the dictionary entries, but if there is a more direct way this would be very useful. A similar question would be asking whether it is possible to construct a pandas DataFrame from json objects listed in a file.

回答1:

A pandas MultiIndex consists of a list of tuples. So the most natural approach would be to reshape your input dict so that its keys are tuples corresponding to the multi-index values you require. Then you can just construct your dataframe using pd.DataFrame.from_dict, using the option orient='index':

user_dict = {12: {'Category 1': {'att_1': 1, 'att_2': 'whatever'},                   'Category 2': {'att_1': 23, 'att_2': 'another'}},              15: {'Category 1': {'att_1': 10, 'att_2': 'foo'},                   'Category 2': {'att_1': 30, 'att_2': 'bar'}}}  pd.DataFrame.from_dict({(i,j): user_dict[i][j]                             for i in user_dict.keys()                             for j in user_dict[i].keys()},                        orient='index')                  att_1     att_2 12 Category 1      1  whatever    Category 2     23   another 15 Category 1     10       foo    Category 2     30       bar 

An alternative approach would be to build your dataframe up by concatenating the component dataframes:

user_ids = [] frames = []  for user_id, d in user_dict.iteritems():     user_ids.append(user_id)     frames.append(pd.DataFrame.from_dict(d, orient='index'))  pd.concat(frames, keys=user_ids)                 att_1     att_2 12 Category 1      1  whatever    Category 2     23   another 15 Category 1     10       foo    Category 2     30       bar 


回答2:

So I used to use a for loop for iterating through the dictionary as well, but one thing I've found that works much faster is to convert to a panel and then to a dataframe. Say you have a dictionary d

import pandas as pd d {'RAY Index': {datetime.date(2014, 11, 3): {'PX_LAST': 1199.46, 'PX_OPEN': 1200.14}, datetime.date(2014, 11, 4): {'PX_LAST': 1195.323, 'PX_OPEN': 1197.69}, datetime.date(2014, 11, 5): {'PX_LAST': 1200.936, 'PX_OPEN': 1195.32}, datetime.date(2014, 11, 6): {'PX_LAST': 1206.061, 'PX_OPEN': 1200.62}}, 'SPX Index': {datetime.date(2014, 11, 3): {'PX_LAST': 2017.81, 'PX_OPEN': 2018.21}, datetime.date(2014, 11, 4): {'PX_LAST': 2012.1, 'PX_OPEN': 2015.81}, datetime.date(2014, 11, 5): {'PX_LAST': 2023.57, 'PX_OPEN': 2015.29}, datetime.date(2014, 11, 6): {'PX_LAST': 2031.21, 'PX_OPEN': 2023.33}}} 

The command

pd.Panel(d) <class 'pandas.core.panel.Panel'> Dimensions: 2 (items) x 2 (major_axis) x 4 (minor_axis) Items axis: RAY Index to SPX Index Major_axis axis: PX_LAST to PX_OPEN Minor_axis axis: 2014-11-03 to 2014-11-06 

where pd.Panel(d)[item] yields a dataframe

pd.Panel(d)['SPX Index'] 2014-11-03  2014-11-04  2014-11-05 2014-11-06 PX_LAST 2017.81 2012.10 2023.57 2031.21 PX_OPEN 2018.21 2015.81 2015.29 2023.33 

You can then hit the command to_frame() to turn it into a dataframe. I use reset_index as well to turn the major and minor axis into columns rather than have them as indices.

pd.Panel(d).to_frame().reset_index() major   minor      RAY Index    SPX Index PX_LAST 2014-11-03  1199.460    2017.81 PX_LAST 2014-11-04  1195.323    2012.10 PX_LAST 2014-11-05  1200.936    2023.57 PX_LAST 2014-11-06  1206.061    2031.21 PX_OPEN 2014-11-03  1200.140    2018.21 PX_OPEN 2014-11-04  1197.690    2015.81 PX_OPEN 2014-11-05  1195.320    2015.29 PX_OPEN 2014-11-06  1200.620    2023.33 

Finally, if you don't like the way the frame looks you can use the transpose function of panel to change the appearance before calling to_frame() see documentation here http://pandas.pydata.org/pandas-docs/dev/generated/pandas.Panel.transpose.html

Just as an example

pd.Panel(d).transpose(2,0,1).to_frame().reset_index() major        minor  2014-11-03  2014-11-04  2014-11-05  2014-11-06 RAY Index   PX_LAST 1199.46    1195.323     1200.936    1206.061 RAY Index   PX_OPEN 1200.14    1197.690     1195.320    1200.620 SPX Index   PX_LAST 2017.81    2012.100     2023.570    2031.210 SPX Index   PX_OPEN 2018.21    2015.810     2015.290    2023.330 

Hope this helps.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!