问题
I have two relational dataframes like the bellow.
df_doc:
|document_id| name|
+-----------+-----+
| 1| aaa|
| 2| bbb|
df_topic:
| topic_id| name|document_id|
+-----------+-----+-----------+
| 1| xxx| 1|
| 2| yyy| 2|
| 3| zzz| 2|
I want merge them to a single nested json file like the bellow.
[
{
"document_id": 1,
"name": "aaa",
"topics": [
{
"topic_id": 1,
"name": "xxx"
}
]
},
{
"document_id": 2,
"name": "bbb",
"topics": [
{
"topic_id": 2,
"name": "yyy"
},
{
"topic_id": 3,
"name": "zzz"
}
]
}
]
That is, I want to do the reverse of what pandas.io.json.json_normalize
does.
An answer using sqlite, is also OK.
NOTE: Both df_doc and df_topic have columns "name" which have the same names but different values
Thanks.
回答1:
If only 2 column df_doc
use map for join new column title
first and then groupby with convert to to_dict and then to_json:
s = df_doc.set_index('document_id')['title']
df_topic['title'] = df_topic['document_id'].map(s)
#filter all columns without values in list
cols = df_topic.columns.difference(['document_id','title'])
j = (df_topic.groupby(['document_id','title'])[cols]
.apply(lambda x: x.to_dict('r'))
.reset_index(name='topics')
.to_json(orient='records'))
print (j)
[{"document_id":1,"title":"aaa","topics":[{"name":"xxx","topic_id":1}]},
{"document_id":2,"title":"bbb","topics":[{"name":"yyy","topic_id":2},
{"name":"zzz","topic_id":3}]}]
If multiple columns in df_doc
use join instead map
:
df = df_topic.merge(df_doc, on='document_id')
print (df)
topic_id name document_id title
0 1 xxx 1 aaa
1 2 yyy 2 bbb
2 3 zzz 2 bbb
cols = df.columns.difference(['document_id','title'])
j = (df.groupby(['document_id','title'])[cols]
.apply(lambda x: x.to_dict('r'))
.reset_index(name='topics')
.to_json(orient='records'))
EDIT: If same columns names is possible add parameter suffixes
for add _
to columns names for unique and last strip
them:
df = df_topic.merge(df_doc, on='document_id', suffixes=('','_'))
print (df)
topic_id name document_id name_
0 1 xxx 1 aaa
1 2 yyy 2 bbb
2 3 zzz 2 bbb
cols = df.columns.difference(['document_id','title'])
j = (df.groupby(['document_id','name_'])[cols]
.apply(lambda x: x.to_dict('r'))
.reset_index(name='topics')
.rename(columns=lambda x: x.rstrip('_'))
.to_json(orient='records'))
print (j)
[{"document_id":1,"name":"aaa","topics":[{"name":"xxx","name_":"aaa","topic_id":1}]},
{"document_id":2,"name":"bbb","topics":[{"name":"yyy","name_":"bbb","topic_id":2},
{"name":"zzz","name_":"bbb","topic_id":3}]}]
来源:https://stackoverflow.com/questions/49953820/merging-two-relational-pandas-dataframes-as-single-nested-json-output