R: removing the last three dots from a string

后端 未结 3 1618
我在风中等你
我在风中等你 2021-01-03 04:49

I have a text data file that I likely will read with readLines. The initial portion of each string contains a lot of gibberish followed by the data I need. Th

相关标签:
3条回答
  • 2021-01-03 05:14

    Reverse the string
    Reverse the pattern you're searching for if necessary - it's not in your case
    Reverse the result

    [haiku-pseudocode]

    a = 'first string of junk... 0.2 0 1' // string to search
    b = 'junk' // pattern to match 
    
    ra = reverseString(a) // now equals '1 0 2.0 ...knuj fo gnirts tsrif'
    rb = reverseString (b) // now equals 'knuj'
    
    // run your regular expression search / replace - search in 'ra' for 'rb'
    // put the result in rResult
    // and then unreverse the result
    // apologies for not knowing the syntax for 'R' regex
    

    [/haiku-pseudocode]

    0 讨论(0)
  • 2021-01-03 05:15

    This does the trick, though not especially elegant...

    options(stringsAsFactors = FALSE)
    
    
    # Search for three consecutive characters of your delimiters, then pull out
    # all of the characters after that
    # (in parentheses, represented in replace by \\1)
    nums <- as.vector(gsub(aa$C1, pattern = "^.*[.,•]{3}\\s*(.*)", replace = "\\1"))
    
    # Use strsplit to break the results apart at spaces and just get the numbers
    # Use unlist to conver that into a bare vector of numbers
    # Use matrix(, nrow = length(x)) to convert it back into a
    # matrix of appropriate length
    num.mat <- do.call(rbind, strsplit(nums, split = " "))
    
    
    # Mash it back together with your original strings
    result <- as.data.frame(cbind(aa, num.mat))
    
    # Give it informative names
    names(result) <- c("original.string", "num1", "num2", "num3")
    
    0 讨论(0)
  • 2021-01-03 05:19

    This will get you most of the way there, and it will have no problems with numbers that include commas:

    # First, use a regex to eliminate the bad pattern.  This regex
    # eliminates any three-character combination of periods, commas,
    # and big dots (•), so long as the combination is followed by 
    # 0-2 spaces and then a digit.
    aa.sub <- as.matrix(
      apply(aa, 1, function (x) 
        gsub('[•.,]{3}(\\s{0,2}\\d)', '\\1', x, perl = TRUE)))
    
    # Second: it looks as though you want your data split into columns.
    # So this regex splits on spaces that are (a) preceded by a letter, 
    # digit, or space, and (b) followed by a digit.  The result is a 
    # list, each element of which is a list containing the parts of 
    # one of the strings in aa.
    aa.list <- apply(aa.sub, 1, function (x) 
      strsplit(x, '(?<=[\\w\\d\\s])\\s(?=\\d)', perl = TRUE))  
    
    # Remove the second element in aa.  There is no space before the 
    # first data column in this string.  As a result, strsplit() split
    # it into three columns, not 4.  That in turn throws off the code
    # below.
    aa.list <- aa.list[-2]
    
    # Make the data frame.
    aa.list <- lapply(aa.list, unlist)  # convert list of lists to list of vectors
    aa.df   <- data.frame(aa.list)      
    aa.df   <- data.frame(t(aa.df), row.names = NULL, stringsAsFactors = FALSE) 
    

    The only thing remaining is to modify the regex for strsplit() so that it can handle the second string in aa. Or perhaps it's better just to handle cases like that manually.

    0 讨论(0)
提交回复
热议问题