问题
Original title: Keep newline character in string during gsub
There is a post, where I try to convert JSON to markdown unordered lists. It is almost done, but there is a pattern which I can not handle. If a string has a space, newline, space sequence in it, then it will be treated as the list item hyphen. If I try to avoid this using some reference to a newline character, then nothing works as I expect.
Input JSON: https://gist.github.com/hermanp/381eaf9f2bf5f2b9cdf22f5295e73eb5
Preferred output (two space indentation) markdown:
- Info
- Python
- The Ultimate Python Beginner's Handbook
- Python Like You Mean It
- Automate the Boring Stuff with Python
- Data science Python notebooks
- Frontend
- CodePen
- JavaScript - Wikipedia
- CSS-Tricks
- Butterick’s Practical Typography
- Front-end Developer Handbook 2019
- Using Ethics In Web Design
- Client-Side Web Development
- Stack Overflow
- HUP
- Hope in Source
To generate the markdown, I use the following two scripts:generate_md()
library(jsonlite)
generate_md <- function (jsonfile) {
bmarks_json_lite <- fromJSON(txt = jsonfile)
level1 <- bmarks_json_lite$children$children[[2]]
markdown_result <- recursive_func(level = level1)
return(markdown_result)
}
recursive_func()
recursive_func <- function (level) {
md_result <- character()
for (i in seq_len(nrow(level))) {
if (level[i, "type"] == "text/x-moz-place"){
md_title <- paste0("- ", level[i, "title"], "\n")
} else if (level[i, "type"] == "text/x-moz-place-container") {
md_title <- paste0("- ", level[i, "title"], "\n")
md_recurs <- recursive_func(level = level[i, "children"][[1]])
# >>>>> This is the problematic part. <<<<<
md_recurs <- gsub("-(?= )", " -", md_recurs, perl = T)
md_title <- paste0(md_title, md_recurs)
}
md_result <- paste0(md_result, md_title)
}
return(md_result)
}
With these functions I can achieve the following (note the unnecessary spaces at the JavaScript Wikipedia entry). I want to get - JavaScript - Wikipedia
instead - JavaScript - Wikipedia
. I hope this example represents the different scenarios with hyphens and indentation, but still, this is just a fraction of my bookmarks. I wanted to give a minimal example.
cat(generate_md(paste0("https://gist.githubusercontent.com/hermanp/",
"381eaf9f2bf5f2b9cdf22f5295e73eb5/raw/",
"76b74b2c3b5e34c2410e99a3f1b6ef06977b2ec7/",
"bookmarks-example-hyphen.json")))
# Output
- Info
- Python
- The Ultimate Python Beginner's Handbook
- Python Like You Mean It
- Automate the Boring Stuff with Python
- Data science Python notebooks
- Frontend
- CodePen
- JavaScript - Wikipedia
- CSS-Tricks
- Butterick’s Practical Typography
- Front-end Developer Handbook 2019
- Using Ethics In Web Design
- Client-Side Web Development
- Stack Overflow
- HUP
- Hope in Source
I modified the gsub
function part in recursive_func
as seen below, without the desired output:
md_recurs <- gsub("-(?= )", " -", md_recurs, perl = T) # Original
md_recurs <- gsub("(\n)?-(?= )", " -", md_recurs, perl = T) # No newlines
md_recurs <- gsub("(-)(?= )(?<=\n)?", " -", md_recurs, perl = T) # Same as Original
Searching for regex newline before char gsub site:stackoverflow.com
on Google, I find no answer or hint to this question. I also played with regex101.com, but could not find the right path.
回答1:
You can use
gsub("\\w\\h+-\\h(*SKIP)(*F)|-(?=\\h)", " -", x, perl=TRUE)
See the regex demo. Details:
\w
- a word char\h+
- one or more horizontal whitespace-
- a-
char\h
- a horizontal whitespace(*SKIP)(*F)
- omit text matched so far, fail the match and start searching from the location where it failed|
- or-
- a-
char(?=\h)
- is immediately followed with a horizontal whitespace.
回答2:
After I thought over the problem and the structure of the string and read about lookbehind I finally came up with the solution.
The md_recurs
row need to be modified as:
md_recurs <- gsub("(?<!(\\w ))-(?= )", " -", md_recurs, perl = T)
Which means the gsub()
pattern
parameter had to be modified to:
(?<!(\\w ))-(?= )
Which means:
- replace a hyphen
-
(to two space and a hyphen-
) - if it is not preceded by a word string and a space
(?<!(\\w ))
and - if it is not followed by a space
(?= )
.
来源:https://stackoverflow.com/questions/65143495/keep-newline-character-and-selectively-indent-in-string-during-gsub