replacing the nth character in a string only if it is a particular character in R

我只是一个虾纸丫 提交于 2019-12-24 06:31:23

问题


I am importing a series of surveys as .csv files and combining into one data set. The problem is for one of the seven files some of the variables are importing slightly differently. The data set is huge and I would like to find a way to write a function to run over dataset that is giving me trouble.

In some of the variables there is an underscore when there should be a dot. Not all variables are of the same format but the ones that are incorrect are, in that the underscore is always the 6th element of the column name.

I want R to look for the 6th element and if it is an underscore replace it with a dot. here is a made up example below.

col_names <- c("s1.help_needed",
               "s1.Q2_im_stuck",
               "s1.Q2.im_stuck",
               "s1.Q3.regex",
               "s1.Q3_regex",
               "s2.Q1.is_confusing",
               "s2.Q2.answer_please",
               "s2.Q2_answer_please",
               "s2.someone_knows_the answer",
               "s3.appreciate_the_help")

I assume there is a Regex answer to this but i am struggling to find one. perhaps there is also a tidyr answer?


回答1:


As @thelatemail pointed out, none of your data actually has underscores in the fifth position, but some have it in the sixth position (where others have dot). A base R approach would be to use gsub():

result <- gsub("^(.{5})_", "\\1.", col_names)

> result
 [1] "s1.help_needed"              "s1.Q2.im_stuck"             
 [3] "s1.Q2.im_stuck"              "s1.Q3.regex"                
 [5] "s1.Q3.regex"                 "s2.Q1.is_confusing"         
 [7] "s2.Q2.answer_please"         "s2.Q2.answer_please"        
 [9] "s2.someone_knows_the answer" "s3.appreciate_the_help"

Here is an explanation of the regex:

^         from the start of the string
(.{5})    match AND capture any five characters
_         followed by an underscore

The quantity in parentheses is called a capture group and can be used in the replacement via \\1. So the regex is saying replace the first six characters with the five characters we captured but use a dot as the sixth character.




回答2:


You can use a "capture-class" defined by the first 4 (actually 5) characters of any sort followed by an underscore and replace with whatever those 5 characters were was followed a "dot". Since all the examples had the underscore in the 6th position, I'm guessing you were not counting the original "dots":

> col_names
 [1] "s1.help_needed"              "s1.Q2_im_stuck"             
 [3] "s1.Q2.im_stuck"              "s1.Q3.regex"                
 [5] "s1.Q3_regex"                 "s2.Q1.is_confusing"         
 [7] "s2.Q2.answer_please"         "s2.Q2_answer_please"        
 [9] "s2.someone_knows_the answer" "s3.appreciate_the_help"     
> sub("^(.....)_", "\\1.", col_names)
 [1] "s1.help.needed"              "s1.Q2.im_stuck"             
 [3] "s1.Q2.im.stuck"              "s1.Q3.regex"                
 [5] "s1.Q3.regex"                 "s2.Q1.is.confusing"         
 [7] "s2.Q2.answer.please"         "s2.Q2.answer_please"        
 [9] "s2.someone.knows_the answer" "s3.appreciate.the_help"

Since the replacement argument does not have the same issues with escapes, you do not need to use the doubled backslashes as you might have used in an R-regex pattern argument.



来源:https://stackoverflow.com/questions/41971945/replacing-the-nth-character-in-a-string-only-if-it-is-a-particular-character-in

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!