How to filter on pandas dataframe when column data type is a list

ぐ巨炮叔叔 提交于 2020-12-26 07:46:52

问题


I am having some trouble filtering a pandas dataframe on a column (let's call it column_1) whose data type is a list. Specifically, I want to return only rows such that column_1 and the intersection of another predetermined list are not empty. However, when I try to put the logic inside the arguments of the .where, function, I always get errors. Below are my attempts, with the errors returned.

  • Attemping to test whether or not a single element is inside the list:

    table[element in table['column_1']] returns the error ... KeyError: False

  • trying to compare a list to all of the lists in the rows of the dataframe:

    table[[349569] == table.column_1] returns the error Arrays were different lengths: 23041 vs 1

I'm trying to get these two intermediate steps down before I test the intersection of the two lists.

Thanks for taking the time to read over my problem!


回答1:


consider the pd.Series s

s = pd.Series([[1, 2, 3], list('abcd'), [9, 8, 3], ['a', 4]])
print(s)

0       [1, 2, 3]
1    [a, b, c, d]
2       [9, 8, 3]
3          [a, 4]
dtype: object

And a testing list test

test = ['b', 3, 4]

Apply a lambda function that converts each element of s to a set and intersection with test

print(s.apply(lambda x: list(set(x).intersection(test))))

0    [3]
1    [b]
2    [3]
3    [4]
dtype: object

To use it as a mask, use bool instead of list

s.apply(lambda x: bool(set(x).intersection(test)))

0    True
1    True
2    True
3    True
dtype: bool



回答2:


Hi for long term use you can wrap the whole work flow in functions and apply the functions where you need. As you did not put any example dataset. I am taking an example data set and resolving it. Considering I have text database. First I will find the #tags into a list then I will search the only #tags I want and filter the data.

# find all the tags in the message
def find_hashtags(post_msg):
    combo = r'#\w+'
    rx = re.compile(combo)
    hash_tags = rx.findall(post_msg)
    return hash_tags


# find the requered match according to a tag list and return true or false
def match_tags(tag_list, htag_list):
    matched_items = bool(set(tag_list).intersection(htag_list))
    return matched_items


test_data = [{'text': 'Head nipid mõnusateks sõitudeks kitsastel tänavatel. #TipStop'},
 {'text': 'Homses Rooli Võimus uus #Peugeot208!\nVaata kindlasti.'},
 {'text': 'Soovitame ennast tulevikuks ette valmistada, electric car sest uus #PeugeotE208 on peagi kohal!  ⚡️⚡️\n#UnboringTheFuture'},
 {'text': "Aeg on täiesti uueks roadtrip'i kogemuseks! \nLase ennast üllatada - #Peugeot5008!"},
 {'text': 'Tõeline ikoon, mille stiil avaldab muljet läbi eco car, electric cars generatsioonide #Peugeot504!'}
]

test_df = pd.DataFrame(test_data)

# find all the hashtags
test_df["hashtags"] = test_df["text"].apply(lambda x: find_hashtags(x))

# the only hashtags we are interested
tag_search = ["#TipStop", "#Peugeot208"]

# match the tags in our list
test_df["tag_exist"] = test_df["hashtags"].apply(lambda x: match_tags(x, tag_search))

# filter the data
main_df = test_df[test_df.tag_exist]


来源:https://stackoverflow.com/questions/39729292/how-to-filter-on-pandas-dataframe-when-column-data-type-is-a-list

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!