Drop rows based on specific conditions on strings

拈花ヽ惹草 提交于 2021-02-08 11:37:56

问题


Given this dataframe (which is a subset of mine):

username user_message
Polop I love this picture, which is very beautiful
Artil Meh
Artingo Es un cuadro preciosa, me recuerda a mi infancia.
Zona I like it
Soi Yuck, to say I hate it would be a euphemism
Iyu NaN

What I'm trying to do is drop rows for which a number of words (tokens) is less than 5 words, and that are not written in English. I'm not familiar with pandas, so I imagined a not so pretty solution:

import pandas as pd
from langdetect import detect
index = 0
index_list = []
for review in df["user_message"]:
    count = 0
    if str(review) == "NaN":
        index_list.append(index)
        continue
    for i in review:
        if(i.isspace()):
            count=count+1
    if len(review) == 0:
        index_list.append(index)
    elif review.isspace() is True:
        index_list.append(index)
    elif count < 5:
        index_list.append(index)
    else:
        try:
            detect(review)
            if detect(review) != "en":
                index_list.append(index)
            else:
                pass
        except:
            pass
    index = index + 1
df = df.drop(index_list, axis = 0).reset_index(drop = True)

This solution apparently is not working (I'm having blank lines that remains in my dataframe and row with only one word) and I'm sure that it exists another efficient method, that is faster. Do you have an idea on how to tackle this issue?

Thank you.

EDIT: So I finally got it to work, thanks to the answer of @ansev. Since TextBlob raises an error if too many requests are sent, I relied on the langdetect module. Here is the corresponding code:

m1 = df['user_message'].str.split(' ').str.len() > 5 
m2 = df['user_message'].str.isspace() 
df_filtered = df.loc[m1 | m2 == False].reset_index(drop=True) 
m3 = df_filtered['user_message'].astype(str).apply(lambda x: detect(x) if len(x) >= 5).eq('en')
df_filtered = df_filtered.loc[m3].reset_index(drop=True)

I had to do m3 separately, since detect raises an error if it cannot identify the text (it is often cause by strings that only contains whitespaces, which is my I did the m2 condition, that checks if cells only contains whitespaces (thus returning True if that is the case)).


回答1:


Use:

from textblob import TextBlob
m1 = df['user_message'].astype(str).apply(lambda x: TextBlob(x).detect_language() 
                                          if len(x) >= 3 else '').eq('en') 
m2 = df['user_message'].str.split(' ').str.len() > 5
df_filtered = df.loc[m1 | m2]
print(df_filtered)

  username                                       user_message
0    Polop       I love this picture, which is very beautiful
2  Artingo  Es un cuadro preciosa, me recuerda a mi infancia.
3     Zona                                          I like it
4      Soi        Yuck, to say I hate it would be a euphemism

Check to install

No Module named textblob



来源:https://stackoverflow.com/questions/65864957/drop-rows-based-on-specific-conditions-on-strings

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!