gnu sed remove portion of line after pattern match with special characters

问题

The goal is to use sed to return only the url from each line of FF extension Mining Blocker which uses this format for its regex lines:

{"baseurl":"*://002.0x1f4b0.com/*", "suburl":"*://*/002.0x1f4b0.com/*"},
{"baseurl":"*://003.0x1f4b0.com/*", "suburl":"*://*/003.0x1f4b0.com/*"},

the result should be:

002.0x1f4b0.com
003.0x1f4b0.com

One way would be to keep everything after suburl":"*://*/ then remove each occurrence of /*"},

I found https://unix.stackexchange.com/questions/24140/return-only-the-portion-of-a-line-after-a-matching-pattern but the special characters are a problem.

this won't work:

sed -n -e s@^.*suburl":"*://*/@@g hosts

Would someone please show me how to mark the 2 asterisks in the string so they are seen by regex as literal characters, not wildcards?

edit:

sed -n 's#.*://\*/\([^/]\+\)/.*#\1#p' hosts

doesn't work, unfortunately.

regarding character substitution, thanks for directing me to the references.

I reduced the searched-for string to //*/ and used ASCII character codes like this:

sed -n -e s@^.*\d047\d047\d042\d047@@g hosts

Unfortunately, that didn't output any changes to the lines.

My assumptions are:

^.*something specifies everything up to and including the last occurrence of "something" in a line

sed -n -e s@search@@g deletes (replace with nothing) "search" within a line

So, this line:

sed -n -e s@^.*\d047\d047\d042\d047@@g hosts

Should output everything after //*/ in each line...except it doesn't.

What is incorrect with that line?

Regarding deleting everything including and after the first / AFTER that first operation, yes, that's wanted too.

回答1:

This might work for you (GNU sed):

sed -n 's#.*://\*/\([^/]\+\)/.*#\1#p' file

Match greedily (the longest string that matches) all characters up to ://*/, followed by a group of characters (which will be referred to as \1) that do not match a /, followed by the rest of the line and replace it by the group \1.

N.B. the sed substitution delimiters are arbitrary, in this case chosen to be # so as make pattern matching / easier. Also the character * on the left hand side of the substitution command may be interpreted as a meta character that means zero or more of the previous character/group and so is quoted \* so that it does not mistakenly exert this property. Finally, using the option -n toggles off the usual printing of every thing in the pattern space after all the sed commands have been executed. The p flag on the substitution command, prints the pattern space following a successful substitution, therefore only URL's will appear in the output or nothing.

来源：https://stackoverflow.com/questions/51490729/gnu-sed-remove-portion-of-line-after-pattern-match-with-special-characters

标签

sed

special-characters