问题
In Python 3.8, I am trying to extract the middle part of a few URLs like the ones below:
s1 = "https://www.rocheplus.es/congresos-eventos/congresos-internacionales/oncologia/2020/astro.protected.html"
s2 = "https://www.rocheplus.es/products/oncologia/erivedge/evidencias-cientificas.cugod.protected.html"
s3 = "https://www.rocheplus.es/formacion/lung-link/estadios-tempranos.protected.html#cpnm-cmd"
s4 = "https://www.rocheplus.es/content/dam/hcp-portals/spain/documents/areas-terapeuticas/oncologia/lung-link/enf-metastasica/cpm/CPM_IMpower133_characterisation%20of%20long-term%20survivors.pdf"
s5 = "https://www.rocheplus.es/content/dam/hcp-portals/spain/images/formacion/medicos/neopath/17_Macro_de_pieza_tras_TNA.jpg"
I would like to end up with the middle part only, e.g.
congresos-eventos/congresos-internacionales/oncologia/2020/astro
products/oncologia/erivedge/evidencias-cientificas
formacion/lung-link/estadios-tempranos
content/dam/hcp-portals/spain/documents/areas-terapeuticas/oncologia/lung-link/enf-metastasica/cpm/CPM_IMpower133_characterisation%20of%20long-term%20survivors
content/dam/hcp-portals/spain/images/formacion/medicos/neopath/17_Macro_de_pieza_tras_TNA
I tried the following code:
import re
x = re.search(r'^https://www.rocheplus.es/(.*)(\.cugod)*(\.protected)*(\..)*', s2)
x.group(1)
but it does not deliver the desired result and I cannot see what am I missing
回答1:
You'd want the regex focus on what you 'not' want. I've fixed the regex for you, as @Thomas kindly suggested you'd also want the first word after the last '/' until the dot.
r'^https:\/\/www\.rocheplus\.es\/(.*\/[^\.]+).*$'
Basically the expression [^not] will match any character that is not 'n', 'o', or 't'. And '/' and '.' needs to be escaped.
So,
s1 = "https://www.rocheplus.es/congresos-eventos/congresos-internacionales/oncologia/2020/astro.protected.html"
x = re.search(r'^https:\/\/www\.rocheplus\.es\/(.*\/[^\.]+).*$', s1)
x.group(1) # is equal to 'congresos-eventos/congresos-internacionales/oncologia/2020/astro'
UPDATE:
If you also want to cover the case where you have no folders, use the '|' character and cover that case separately:
pattern = r'^https:\/\/www\.rocheplus\.es\/(.*\/[^\.]+|[^\.\/]+).*$'
x = re.search(pattern, "https://www.rocheplus.es/informacion-cientifica-roche.html")
x.group(1) # is equal to 'informacion-cientifica-roche'
回答2:
Your initial approach was quite good, just too complicated and failing because of a few details with how regex works.
First, I would recommend you to use a tool like https://regex101.com/ and set it to python, so you can try out your regex interactively.
The try to come up with the requirements for your problem:
- You want to match everything after
https://www.rocheplus.es/, so your regex has to start with something likehttps://www\.rocheplus\.es/. Don't forget to escape the dots. In case your regex should only match if a line starts with this, use^https://www\.rocheplus\.es/(as you already did) - You want to end the match as soon as a a
.has occured. I'm not too sure about this, but it seems that would be easier than matching forprotected, right? So your regex has to end in this term\., matching a dot - Now for the matching, I started by using
https://www\.rocheplus\.es/(.*)\.in the regex viewer. I basically combined the two parts and inserted(.*)to capture everything inbetween. That didn't work, because it'll match everything until the last dot instead of the first. That happens because by default.*is greedy, means it matches as much as possible. You can change the behaviour by appending a?after the asterisk, so your complete regex will look like this:
https://www\.rocheplus\.es/(.*?)\. (see https://regex101.com/r/JnVsPP/1)
This should hopefully give you a good start to refine your regex from there.
来源:https://stackoverflow.com/questions/66028013/match-regular-expression-containing-multiple-dots