Correcting strings scattered over multiple lines based on locating a varying code (with headers)

后端 未结 1 1750
借酒劲吻你
借酒劲吻你 2020-12-22 09:41

I uploaded a .txt file in to R as follows: Election_Parties <- readr::read_lines(\"Election_Parties.txt\") The following text is in

相关标签:
1条回答
  • 2020-12-22 10:24

    I turned the whole thing into a tidy and useful format. Have a look:

    First I read in the file:

    lines <- readr::read_lines("https://pastebin.com/raw/jSrvTa7G")
    head(lines)
    #> [1] ""                                                                                                        
    #> [2] "ALBANIA"                                                                                                 
    #> [3] "P1-Democratic Alliance Party (Partia Aleanca Democratike [AD])"                                          
    #> [4] "P2-National Unity Party (Partia Uniteti Kombëtar [PUK])"                                                 
    #> [5] "P3-Social Spectrum Parties-Party of National Unity (Partitë e Spektrit Social-Partia e Unitetit Kombëtar"
    #> [6] "[PSHS-PUK])"
    

    I split the raw format into entries by looking for empty lines, which occur just before a new entry:

    entries <- split(lines, cumsum(grepl("^$|^ $", lines)))
    

    Then I loop through every entry and turn it into a tibble:

    library(stringr)
    library(dplyr)
    df <- lapply(entries, function(entry) {
      entry <- entry[!grepl("^$|^ $", entry)] # remove empty elements
      header <- entry[1] # first non empty is the header
      entry <- tail(entry, -1)  # remove header from entry
      desc <- str_extract(entry, "^P\\d+-")  # extract description
    
      for (l in which(is.na(desc))) { # collapse lines that go over 2 elements
        entry[l - 1] <- paste(entry[l - 1], entry[l], sep = " ")
      }
    
      entry <- entry[!is.na(desc)]
      desc <- desc[!is.na(desc)]
    
      # turn into nice format
      df <- tibble::tibble(
        header,
        desc,
        entry
      )
      df$entry <- str_replace_all(df$entry, fixed(df$desc), "") # remove description from entry
      return(df)
    }) %>% 
      bind_rows() # turn list into one data.frame
    

    And now we have a really nice data.frame we can easily work with:

    df
    #> # A tibble: 5,525 x 3
    #>    header  desc  entry                                                     
    #>    <chr>   <chr> <chr>                                                     
    #>  1 ALBANIA P1-   Democratic Alliance Party (Partia Aleanca Democratike [AD~
    #>  2 ALBANIA P2-   National Unity Party (Partia Uniteti Kombëtar [PUK])      
    #>  3 ALBANIA P3-   Social Spectrum Parties-Party of National Unity (Partitë ~
    #>  4 ALBANIA P4-   Alliance Party for Solidarity and Welfare (Partia Aleanca~
    #>  5 ALBANIA P5-   Albanian Democratic Union-Alliance for Freedom, Justice a~
    #>  6 ALBANIA P6-   Liberal Democrat Party (Partia Bashkimi Liberal Demokrat ~
    #>  7 ALBANIA P7-   Linking Blerta Albanian Party (Partia Lidhja e Blertë Shq~
    #>  8 ALBANIA P8-   Democratic Movement for Integration (Lëvizja Demokratike ~
    #>  9 ALBANIA P9-   Movement of Human Rights and Freedoms Party (Partia Lëviz~
    #> 10 ALBANIA P10-  Socialist Party of Albania (Partia Socialiste e Shqipëris~
    #> # ... with 5,515 more rows
    

    The strings which are scattered over multiple lines are corrected in this bit:

      for (l in which(is.na(desc))) { # collapse lines that go over 2 elements
        entry[l - 1] <- paste(entry[l - 1], entry[l], sep = " ")
      }
    

    desc will be NA in cases where the line does not begin with e.g., "P1-" (1 can be any number). If this is the case the line is collapse with the previous entry. Later on NA are removed leaving the information only in the correct line.

    0 讨论(0)
提交回复
热议问题