I have a pandas dataframe as:
word_list
[\'nuclear\',\'election\',\'usa\',\'baseball\']
[\'football\',\'united\',\'thriller\']
[\'marvels\',\'hollywood\',\'s
First of all, I think you should take advantage of O(1)
lookup from sets and dictionaries. That said, I'd set the data as (notice that values are sets):
d = dict(movies={'spiderman','marvels','thriller'},
sports={'baseball','hockey','football'},
politics={'election','china','usa'})
Then, you can transform
your series using your custom logic
def f(r):
def m(r_):
_ = [k for (k, v) in d.items() if r_ in v]
return _ if _ else ['Misc']
return {item for z in [m(r_) for r_ in r] for item in z}
df.word_list.transform(f)
0 {Misc, sports, politics}
1 {Misc, sports, movies}
2 {Misc, movies}
For 300000 rows,
%timeit df.word_list.transform(f)
1.1 s ± 22.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
which is not great but doable..
You can flatten dictionary of lists first and then lookup by .get
with miscellaneous
for non matched values, then convert to set
s for unique categories and convert to string
s by join
:
movies=['spiderman','marvels','thriller']
sports=['baseball','hockey','football']
politics=['election','china','usa']
d = {'movies':movies, 'sports':sports, 'politics':politics}
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
f = lambda x: ','.join(set([d1.get(y, 'miscellaneous') for y in x]))
df['matched_list_names'] = df['word_list'].apply(f)
print (df)
word_list matched_list_names
0 [nuclear, election, usa, baseball] politics,miscellaneous,sports
1 [football, united, thriller] miscellaneous,sports,movies
2 [marvels, hollywood, spiderman, budget] miscellaneous,movies
Similar solution with list comprehension:
df['matched_list_names'] = [','.join(set([d1.get(y, 'miscellaneous') for y in x]))
for x in df['word_list']]