stringr

Parsing Interview Text

社会主义新天地 提交于 2021-02-20 04:14:07
问题 I have a text file of a presidential debate. Eventually, I want to parse the text into a dataframe where each row is a statement, with one column with the speaker's name and another column with the statement. For example: "Bob Smith: Hi Steve. How are you doing? Steve Brown: Hi Bob. I'm doing well!" Would become: name text 1 Bob Smith Hi Steve. How are you doing? 2 Steve Brown Hi Bob. I'm doing well! Question: How do I split the statements from the names? I tried splitting on the colon: data

Replace multiple values using a reference table

百般思念 提交于 2021-02-19 07:34:09
问题 I’m cleaning a data base, one of the fields is “country” however the country names in my data base do not match the output I need. I though of using str_replace function but I have over 50 countries that need to be fix, so it’s not the most efficient way. I already prepared a CSV file with the original country input and the output I need for reference. Here is what I have so far: library(stringr) library(dplyr) library(tidyr) library(readxl) database1<- read_excel("database.xlsx") database1

R - Replace multiple patterns with multiple ids

雨燕双飞 提交于 2021-02-17 06:22:08
问题 This was partially already tackled in others posts but unfortunately I could not make it run properly on my side. I have a data frame full of texts, and there are certain words that I want replaced by a unique name. So, if we see the table bellow, I would want to replace each of the words "Banana Apple Tomato" by the word "Fruit" (the word Fruit can show up multiple times, that is ok) I also want to replace "Cod Pork Beef" by the word "Animals" I have a full excel file where this mapping was

Separate string into many columns

元气小坏坏 提交于 2021-02-16 18:23:06
问题 I'd like to separate each letter or symbol in a string for composing a new data.frame with dimension equals the number of letters. I tried to use the function separate from tidyr package, but the result is not desired. df <- data.frame(x = c('house', 'mouse'), y = c('count', 'apple'), stringsAsFactors = F) unexpected result df[1, ] %>% separate(x, c('A1', 'A2', 'A3', 'A4', 'A5'), sep ='') A1 A2 A3 A4 A5 y 1 <NA> <NA> <NA> <NA> <NA> count Expected output A1 A2 A3 A4 A5 h o u s e m o u s e

R remove words from sentences in dataframe

点点圈 提交于 2021-02-16 14:06:50
问题 I have one dataframe with two columns which each containing sentences and I would like to subtract one from the other. I somehow can't easily find a method to do the following: > c1 <- c("A short story","Not so short") > c2 <- c("A short", "Not so") > data.frame(c1, c2) which should give the result of c1 - c2 "story","short" Any ideas are helpful. 回答1: We can use str_remove which is vectorized library(stringr) library(dplyr) df1 %>% mutate(c3 = str_remove_all(c1, c2)) c1 c2 c3 #1 A short

R remove words from sentences in dataframe

我与影子孤独终老i 提交于 2021-02-16 14:06:09
问题 I have one dataframe with two columns which each containing sentences and I would like to subtract one from the other. I somehow can't easily find a method to do the following: > c1 <- c("A short story","Not so short") > c2 <- c("A short", "Not so") > data.frame(c1, c2) which should give the result of c1 - c2 "story","short" Any ideas are helpful. 回答1: We can use str_remove which is vectorized library(stringr) library(dplyr) df1 %>% mutate(c3 = str_remove_all(c1, c2)) c1 c2 c3 #1 A short

remove leading 0s with stringr in R

醉酒当歌 提交于 2021-02-16 10:19:11
问题 I have the following data id 00001 00010 00022 07432 I would like to remove the leading 0 s so the data would like like the following id 1 10 22 7432 回答1: Using the new str_remove function in stringr : id = str_remove(id, "^0+") 回答2: Here is a base R option using sub : id <- sub("^0+", "", id) id [1] "1" "10" "22" "7432" Demo 回答3: We can just convert to numeric as.numeric(df1$id) [#1] 1 10 22 7432 If we require a character class output, str_replace from stringr can be used library(stringr)

Regular Expression R: Select the above or below lines of a regexp selection while meeting another regexp criteria

无人久伴 提交于 2021-02-11 13:58:28
问题 I am working with a text document similar to the examples below. File <- c("Location Name Code and Label Frequency Percentage", " During the past 30 days, on how many days did you carry a weapon", "44-44 Q13 such as a gun, knife, or club on school property?", " 1 0 days 1,610 94.5", " 2 1 day 71 4.3", " 3 2 or 3 days 6 0.4", " 4 4 or 5 days 3 0.2", " 5 6 or more days 12 0.7", " Missing 48", "45-45 Q14 During the past 12 months, on how many days did you carry a gun?", " 1 0 days 1,602 91.3", "

Regular Expression R: Select the above or below lines of a regexp selection while meeting another regexp criteria

岁酱吖の 提交于 2021-02-11 13:57:24
问题 I am working with a text document similar to the examples below. File <- c("Location Name Code and Label Frequency Percentage", " During the past 30 days, on how many days did you carry a weapon", "44-44 Q13 such as a gun, knife, or club on school property?", " 1 0 days 1,610 94.5", " 2 1 day 71 4.3", " 3 2 or 3 days 6 0.4", " 4 4 or 5 days 3 0.2", " 5 6 or more days 12 0.7", " Missing 48", "45-45 Q14 During the past 12 months, on how many days did you carry a gun?", " 1 0 days 1,602 91.3", "

Tidy Evaluation not working with mutate and stringr

守給你的承諾、 提交于 2021-02-11 13:43:16
问题 I've trying to use Tidy Eval and Stringr togheter inside a mutate pipe, but every time I run it it gives me an undesirable result. Instead of changing the letter 'a' for the letter 'X', it overwrite the entire vector with the column name, as you can see in the example below, that uses the IRIS dataset. text_col="Species" iris %>% mutate({{text_col}} := str_replace_all({{text_col}}, pattern = "a", replacement = "X")) result: structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4, 4.6, 5, 4