Match regular expression containing multiple dots [duplicate]

我是研究僧i 提交于 2021-02-05 12:18:53

问题


In Python 3.8, I am trying to extract the middle part of a few URLs like the ones below:

s1 = "https://www.rocheplus.es/congresos-eventos/congresos-internacionales/oncologia/2020/astro.protected.html"
s2 = "https://www.rocheplus.es/products/oncologia/erivedge/evidencias-cientificas.cugod.protected.html"
s3 = "https://www.rocheplus.es/formacion/lung-link/estadios-tempranos.protected.html#cpnm-cmd"
s4 = "https://www.rocheplus.es/content/dam/hcp-portals/spain/documents/areas-terapeuticas/oncologia/lung-link/enf-metastasica/cpm/CPM_IMpower133_characterisation%20of%20long-term%20survivors.pdf"
s5 = "https://www.rocheplus.es/content/dam/hcp-portals/spain/images/formacion/medicos/neopath/17_Macro_de_pieza_tras_TNA.jpg"

I would like to end up with the middle part only, e.g.

congresos-eventos/congresos-internacionales/oncologia/2020/astro
products/oncologia/erivedge/evidencias-cientificas
formacion/lung-link/estadios-tempranos
content/dam/hcp-portals/spain/documents/areas-terapeuticas/oncologia/lung-link/enf-metastasica/cpm/CPM_IMpower133_characterisation%20of%20long-term%20survivors
content/dam/hcp-portals/spain/images/formacion/medicos/neopath/17_Macro_de_pieza_tras_TNA

I tried the following code:

import re
x = re.search(r'^https://www.rocheplus.es/(.*)(\.cugod)*(\.protected)*(\..)*', s2)
x.group(1)

but it does not deliver the desired result and I cannot see what am I missing


回答1:


You'd want the regex focus on what you 'not' want. I've fixed the regex for you, as @Thomas kindly suggested you'd also want the first word after the last '/' until the dot.

r'^https:\/\/www\.rocheplus\.es\/(.*\/[^\.]+).*$'

Basically the expression [^not] will match any character that is not 'n', 'o', or 't'. And '/' and '.' needs to be escaped.

So,

s1 = "https://www.rocheplus.es/congresos-eventos/congresos-internacionales/oncologia/2020/astro.protected.html"
x = re.search(r'^https:\/\/www\.rocheplus\.es\/(.*\/[^\.]+).*$', s1)
x.group(1)  # is equal to 'congresos-eventos/congresos-internacionales/oncologia/2020/astro'

UPDATE:

If you also want to cover the case where you have no folders, use the '|' character and cover that case separately:

pattern = r'^https:\/\/www\.rocheplus\.es\/(.*\/[^\.]+|[^\.\/]+).*$'

x = re.search(pattern, "https://www.rocheplus.es/informacion-cientifica-roche.html")
x.group(1)  # is equal to 'informacion-cientifica-roche'



回答2:


Your initial approach was quite good, just too complicated and failing because of a few details with how regex works.

First, I would recommend you to use a tool like https://regex101.com/ and set it to python, so you can try out your regex interactively.

The try to come up with the requirements for your problem:

  1. You want to match everything after https://www.rocheplus.es/, so your regex has to start with something like https://www\.rocheplus\.es/. Don't forget to escape the dots. In case your regex should only match if a line starts with this, use ^https://www\.rocheplus\.es/ (as you already did)
  2. You want to end the match as soon as a a . has occured. I'm not too sure about this, but it seems that would be easier than matching for protected, right? So your regex has to end in this term \., matching a dot
  3. Now for the matching, I started by using https://www\.rocheplus\.es/(.*)\. in the regex viewer. I basically combined the two parts and inserted (.*) to capture everything inbetween. That didn't work, because it'll match everything until the last dot instead of the first. That happens because by default .* is greedy, means it matches as much as possible. You can change the behaviour by appending a ? after the asterisk, so your complete regex will look like this:

https://www\.rocheplus\.es/(.*?)\. (see https://regex101.com/r/JnVsPP/1)

This should hopefully give you a good start to refine your regex from there.



来源:https://stackoverflow.com/questions/66028013/match-regular-expression-containing-multiple-dots

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!