summarize from string matches

最后都变了- 提交于 2019-12-11 18:42:33

问题


I have this df column:

df <- data.frame(Strings = c("ñlas onepojasd", "onenañdsl", "ñelrtwofkld", "asdthreeasp", "asdfetwoasd", "fouroqwke","okasdtwo", "acmofour", "porefour", "okstwo"))
> df
          Strings
1  ñlas onepojasd
2       onenañdsl
3     ñelrtwofkld
4     asdthreeasp
5     asdfetwoasd
6       fouroqwke
7        okasdtwo
8        acmofour
9        porefour
10         okstwo

I know that each value from df$Strings will match with the words one, two, three or four. And I also know that it will match with just ONE of those words. So to match them:

str_detect(df$Strings,"one")
str_detect(df$Strings,"two")
str_detect(df$Strings,"three")
str_detect(df$Strings,"four")

However, I'm stucked here, as I'm trying to do this table:

Homes  Quantity Percent
  One         2     0.3
  Two         4     0.4
Three         1     0.1
 Four         3     0.3
Total        10       1

回答1:


With tidyverse and janitor you can do:

df %>%
 mutate(Homes = str_extract(Strings, "one|two|three|four"),
        n = n()) %>%
 group_by(Homes) %>%
 summarise(Quantity = length(Homes),
           Percent = first(length(Homes)/n)) %>%
 adorn_totals("row")

 Homes Quantity Percent
  four        3     0.3
   one        2     0.2
 three        1     0.1
   two        4     0.4
 Total       10     1.0

Or with just tidyverse:

 df %>%
 mutate(Homes = str_extract(Strings, "one|two|three|four"),
        n = n()) %>%
 group_by(Homes) %>%
 summarise(Quantity = length(Homes),
           Percent = first(length(Homes)/n)) %>%
 rbind(., data.frame(Homes = "Total", Quantity = sum(.$Quantity), 
                     Percent = sum(.$Percent)))

In both cases the code, first, extracts the matching pattern and count the number of cases. Second, it groups by the matched words. Third, it computes the number of cases per word and the proportion of the given word from all words. Finally, it adds a "Total" row.




回答2:


You can use str_extract and then do the table and prop.table, i.e.

library(stringr)

str_extract(df1$Strings, 'one|two|three|four')
#[1] "one"   "one"   "two"   "three" "two"   "four"  "two"   "four"  "four"  "two"  

table(str_extract(df1$Strings, 'one|two|three|four'))
# four   one three   two 
#    3     2     1     4 

prop.table(table(str_extract(df1$Strings, 'one|two|three|four')))
# four   one three   two 
#  0.3   0.2   0.1   0.4 



回答3:


A base R option would be regmatches/regexpr with table

table(regmatches(df$Strings, regexpr('one|two|three|four', df$Strings)))
#  four   one three   two 
#    3     2     1     4 

adding addmargins to get the sum and then divide by that

out <- addmargins(table(regmatches(df$Strings, 
     regexpr('one|two|three|four', df$Strings))))
out/out[length(out)]

# four   one three   two   Sum 
#  0.3   0.2   0.1   0.4   1.0 


来源:https://stackoverflow.com/questions/54787651/summarize-from-string-matches

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!