Keep newline character and selectively indent in string during gsub

允我心安 提交于 2020-12-15 04:59:06

问题


Original title: Keep newline character in string during gsub

There is a post, where I try to convert JSON to markdown unordered lists. It is almost done, but there is a pattern which I can not handle. If a string has a space, newline, space sequence in it, then it will be treated as the list item hyphen. If I try to avoid this using some reference to a newline character, then nothing works as I expect.

Input JSON: https://gist.github.com/hermanp/381eaf9f2bf5f2b9cdf22f5295e73eb5
Preferred output (two space indentation) markdown:

- Info
  - Python
    - The Ultimate Python Beginner's Handbook
    - Python Like You Mean It
    - Automate the Boring Stuff with Python
    - Data science Python notebooks
  - Frontend
    - CodePen
    - JavaScript - Wikipedia
    - CSS-Tricks
    - Butterick’s Practical Typography
    - Front-end Developer Handbook 2019
    - Using Ethics In Web Design
    - Client-Side Web Development
  - Stack Overflow
  - HUP
  - Hope in Source

To generate the markdown, I use the following two scripts:
generate_md()

library(jsonlite)

generate_md <- function (jsonfile) {
  bmarks_json_lite <- fromJSON(txt = jsonfile)
  level1 <- bmarks_json_lite$children$children[[2]]
  markdown_result <- recursive_func(level = level1)
  return(markdown_result)
}

recursive_func()

recursive_func <- function (level) {
  md_result <- character()
  
  for (i in seq_len(nrow(level))) {
    if (level[i, "type"] == "text/x-moz-place"){
      md_title <- paste0("- ", level[i, "title"], "\n")
    } else if (level[i, "type"] == "text/x-moz-place-container") {
      md_title <- paste0("- ", level[i, "title"], "\n")
      md_recurs <- recursive_func(level = level[i, "children"][[1]])
      
      # >>>>> This is the problematic part. <<<<<
      md_recurs <- gsub("-(?= )", "  -", md_recurs, perl = T)
      md_title <- paste0(md_title, md_recurs)
    }
    
    md_result <- paste0(md_result, md_title)
  }
  
  return(md_result)
}

With these functions I can achieve the following (note the unnecessary spaces at the JavaScript Wikipedia entry). I want to get - JavaScript - Wikipedia instead - JavaScript - Wikipedia. I hope this example represents the different scenarios with hyphens and indentation, but still, this is just a fraction of my bookmarks. I wanted to give a minimal example.

cat(generate_md(paste0("https://gist.githubusercontent.com/hermanp/",
                       "381eaf9f2bf5f2b9cdf22f5295e73eb5/raw/",
                       "76b74b2c3b5e34c2410e99a3f1b6ef06977b2ec7/",
                       "bookmarks-example-hyphen.json")))
# Output
- Info
  - Python
    - The Ultimate Python Beginner's Handbook
    - Python Like You Mean It
    - Automate the Boring Stuff with Python
    - Data science Python notebooks
  - Frontend
    - CodePen
    - JavaScript     - Wikipedia
    - CSS-Tricks
    - Butterick’s Practical Typography
    - Front-end Developer Handbook 2019
    - Using Ethics In Web Design
    - Client-Side Web Development
  - Stack Overflow
  - HUP
  - Hope in Source

I modified the gsub function part in recursive_func as seen below, without the desired output:

md_recurs <- gsub("-(?= )", "  -", md_recurs, perl = T)  # Original
md_recurs <- gsub("(\n)?-(?= )", "  -", md_recurs, perl = T)  # No newlines
md_recurs <- gsub("(-)(?= )(?<=\n)?", "  -", md_recurs, perl = T)  # Same as Original

Searching for regex newline before char gsub site:stackoverflow.com on Google, I find no answer or hint to this question. I also played with regex101.com, but could not find the right path.


回答1:


You can use

gsub("\\w\\h+-\\h(*SKIP)(*F)|-(?=\\h)", "  -", x, perl=TRUE)

See the regex demo. Details:

  • \w - a word char
  • \h+ - one or more horizontal whitespace
  • - - a -char
  • \h - a horizontal whitespace
  • (*SKIP)(*F) - omit text matched so far, fail the match and start searching from the location where it failed
  • | - or
  • - - a - char
  • (?=\h) - is immediately followed with a horizontal whitespace.



回答2:


After I thought over the problem and the structure of the string and read about lookbehind I finally came up with the solution.

The md_recurs row need to be modified as:

md_recurs <- gsub("(?<!(\\w ))-(?= )", "  -", md_recurs, perl = T)

Which means the gsub() pattern parameter had to be modified to:

(?<!(\\w ))-(?= )

Which means:

  • replace a hyphen - (to two space and a hyphen -)
  • if it is not preceded by a word string and a space (?<!(\\w )) and
  • if it is not followed by a space (?= ).


来源:https://stackoverflow.com/questions/65143495/keep-newline-character-and-selectively-indent-in-string-during-gsub

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!