There\'s valid json in a javascript on a html page that I want to parse with a shell script.
First of all I would like to get the entire json string from {
to
Usually it is not recommended to use unix command line tools for parsing HTML. But if you know your marker string foo.bar.Processor.message
, then you may use this sed + jq
solution:
sed -n 's/foo\.bar\.Processor\.message(\([^)]*\).*/\1/p' file.html |
jq -r '.head.url | split(";")[1] | split("=")[1]'
347EDAFA2B136D7825745B0A490DE32
In the absence of jq
, you may use this sed + gnu grep
solution:
sed -n 's/foo\.bar\.Processor\.message(\([^)]*\).*/\1/p' file.html |
grep -oP ';barid=\K\w+'
One option might be to use pup, at least for parsing the HTML:
< input.html pup 'script:not(:empty) text{}' |
grep foo.bar.Processor.message | grep -o '{.*}' |
jq -r '.head.url
| split(";")[]
| select(test("barid="))
| sub("barid=";"")'
With your HTML (adjusted to ensure the JSON in the HTML is valid), this produces:
347EDAFA2B136D7825745B0A490DE32
Of course there are many caveats. YMMV.