pandas.io.json.json_normalize with very nested json

懵懂的女人 提交于 2019-11-27 14:00:02

In the pandas example (below) what do the brackets mean? Is there a logic to be followed to go deeper with the [].

Each element in the ['state', 'shortname', ['info', 'governor']] is a path to an element to include, in addition to the selected rows. The 'counties' argument set what rows should be produced, and that second argument adds metadata that will be included with those rows.

Each is path, a list is a nested structure. In the example output you see the corresponding values in the state, shortname and info.governor columns.

In your example JSON, there are few nested lists to elevate with the first argument, like 'counties' did in the example. The only example in that datastructure is the nested 'authors' key; you'd have to extract each ['_source', 'authors'] path, after which you can add other keys from the parent object to augment those rows:

>>> json_normalize(raw, [['_source', 'authors']], ['_id', ['_source', 'journal'], ['_source', 'title']])
                      affiliations author_id          author_name       _id  \
0                              NaN  166468F4  a bowdoin van riper  7FDFEB02
1                              NaN  81070854   jeffrey h schwartz  7FDFEB02
2  [Pennsylvania State University]  7E15BDFA       roger l geiger  7538108B

                  _source.journal  \
0  The American Historical Review
1  The American Historical Review
2  The American Historical Review

                                       _source.title
0  Men Among the Mammoths: Victorian Science and ...
1  Men Among the Mammoths: Victorian Science and ...
2  Elizabeth Popp Berman. Creating the Market Uni...

So this is a dataframe of authors, with added metadata for each author (_id value, journal name and article title).

Note the path for the first argument; if you want to list a nested path you need to provide a list of paths (even if it is just one path); just ['_source', 'authors'] would look for two row sources, each a simple top-level name.

The second argument then pulls in the _id key from the outermost object, but the title and journal name are list paths, as these are nested too.

You can also have a look at the library flatten_json, which does not require you to write column hierarchies as in json_normalize:

from flatten_json import flatten

data = d['hits']['hits']
dict_flattened = (flatten(record, '.') for record in data)
df = pd.DataFrame(dict_flattened)
print(df)

See https://github.com/amirziai/flatten.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!