Pandas: combine columns without duplicates/ find unique words after combining

纵然是瞬间 提交于 2019-12-22 00:42:13

问题


I have a dataframe where I would like to concatenate certain columns.

My issue is that the text in these columns may or may not contain duplicate information. I would like to strip out the duplicates in order to retain only the relevant information.

For example, if I had a data frame such as:

pd.read_csv("animal.csv")

  animal1         animal2        label  
1 cat dog         dolphin        19
2 dog cat         cat            72
3 pilchard 26     koala          26
4 newt bat 81     bat            81

I want to combine the columns but retain only unique information from each of the strings.

You can see that in row 2, 'cat' is contained in both columns 'Animal1' and 'Animal2'. In row 3, the number 26 is in both column 'Animal1' and 'Label'. Whereas in row 4, information that is in columns 'Animal2' and 'Label' are already contained in order in 'Animal1'.

I combine the columns by doing the following

animals["detail"] = animals["animal1"].map(str) + animals["animal2"].map(str) + animals["label"].map(str)

  animal1         animal2        label        detail  
1 cat dog         dolphin        19           cat dog dolphin 19
2 dog cat         cat            72           dog cat cat 72
3 pilchard 26     koala          26           pilchard 26 koala 26
4 newt bat 81     bat            81           newt bat 81 bat 81

Row 1 is fine, but the other rows, of course, contain duplicates as described above.

The output I would desire is:

  animal1         animal2        label        detail  
1 cat dog         dolphin        19           cat dog dolphin 19
2 dog cat         cat            72           dog cat 72
3 pilchard 26     koala          26           pilchard koala 26
4 newt bat 81     bat            81           newt bat 81

or if I could retain only the first unique instance of each word/ number per row in the detail column, this would also be suitable i.e.:

  detail 
1 cat dog dolphin 19
2 dog cat 72
3 pilchard koala 26
4 newt bat 81

I've had a look at doing this for a string in python e.g. How can I remove duplicate words in a string with Python?, How to get all the unique words in the data frame?, show distinct column values in pyspark dataframe: python but can't figure out how to apply this to individual rows within the detail column. I've looked at splitting the text after I've combined the columns, then using apply and lambda, but haven't got this to work yet. Or is there perhaps a way to do it when combining the columns?

I have the solution in R but want to recode in python.

Would greatly appreciate any help or advice. I'm currently using Spyder(Python 3.5)


回答1:


You can add custom function where first split by whitespace, then get unique values by pandas.unique and last join to string back:

animals["detail"] = animals["animal1"].map(str) + ' ' + 
                    animals["animal2"].map(str) + ' ' +
                    animals["label"].map(str)

animals["detail"] = animals["detail"].apply(lambda x: ' '.join(pd.unique(x.split())))
print (animals)
       animal1  animal2  label              detail
1      cat dog  dolphin     19  cat dog dolphin 19
2      dog cat      cat     72          dog cat 72
3  pilchard 26    koala     26   pilchard 26 koala
4  newt bat 81      bat     81         newt bat 81

Also is possible join values in apply:

animals["detail"] = animals.astype(str)
                           .apply(lambda x: ' '.join(pd.unique(' '.join(x).split())),axis=1)
print (animals)
       animal1  animal2  label              detail
1      cat dog  dolphin     19  cat dog dolphin 19
2      dog cat      cat     72          dog cat 72
3  pilchard 26    koala     26   pilchard 26 koala
4  newt bat 81      bat     81         newt bat 81

Solution with set, but it change order:

animals["detail"] = animals.astype(str)
                           .apply(lambda x: ' '.join(set(' '.join(x).split())), axis=1)
print (animals)
       animal1  animal2  label              detail
1      cat dog  dolphin     19  cat dolphin 19 dog
2      dog cat      cat     72          cat dog 72
3  pilchard 26    koala     26   26 pilchard koala
4  newt bat 81      bat     81         bat 81 newt


来源:https://stackoverflow.com/questions/44497671/pandas-combine-columns-without-duplicates-find-unique-words-after-combining

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!