问题
This a quite annoying but rather a much simpler task. According to this guide, I wrote this:
#!/bin/bash
content=$(wget "https://example.com/" -O -)
ampersand=$(echo '\&')
xmllint --html --xpath '//*[@id="table"]/tbody' - <<<"$content" 2>/dev/null |
xmlstarlet sel -t \
-m "/tbody/tr/td" \
-o "https://example.com" \
-v "a//@href" \
-o "/?A=1" \
-o "$ampersand" \
-o "B=2" -n \
I successfully extract each link from the table and everything gets concatenated correctly, however, instead of reproducing the ampersand as & I receive this at the end of each link:
https://example.com/hello-world/?A=1\&B=2
But actually, I was looking for something like:
https://example.com/hello-world/?A=1&B=2
The idea is to escape the character using a backslash \&
so that it gets ignored. Initially, I tried placing it directly into -o "\&" \
instead of -o "$ampersand" \
and removing ampersand=$(echo '\&')
in this case scenario. Still the same result.
Essentially, by removing the backslash it still outputs:
https://example.com/hello-world/?A=1&B=2
Only that the \
behind the &
is removed.
Why?
I'm sure it is something basic that is missing.
回答1:
Sorry I can't reproduce your result but why don't make substitutions? Just filter your results through
sed 's/\\&/\&/g'
add it to your pipe. It should replace all & to &.
回答2:
&
is the correct way to print &
in an XML document, but since you just want a plain URL your output should not be XML. Therefore you need to switch to text mode, by passing --text
or -T
to the sel
command.
Your example input doesn't quite work because example.com doesn't have any table
elements, but here is a working example building links from p
elements instead.
content=$(wget 'https://example.com/' -O -)
xmlstarlet fo --html <<<"$content" |
xmlstarlet sel -T -t \
-m '//p[a]' \
--if 'not(starts-with(a//@href,"http"))' \
-o 'https://example.com/' \
--break \
-v 'a//@href' \
-o '/?A=1' \
-o '&' \
-o 'B=2' -n
The output is
http://www.iana.org/domains/example/?A=1&B=2
回答3:
As you have already seen, backslash-escaping isn't the solution here. I can think of two possible options:
Extract the hrefs (probably don't need to be using both xmllint
and xmlstarlet
to do this), then just use a standard text processing tool such as sed
to add the start and the end:
sed 's,^,https://example.com/,; s,$,/?A=1\&B=2,'
Alternatively, pipe the output of what you've currently got to xmlstarlet unesc
, which will change &
into &
.
来源:https://stackoverflow.com/questions/46255304/unescape-the-ampersand-via-xmlstarlet-bugging-amp