Regex for links in html text

前端 未结 8 1667
旧巷少年郎
旧巷少年郎 2020-12-16 04:42

I hope this question is not a RTFM one. I am trying to write a Python script that extracts links from a standard HTML webpage (the tags). I hav

8条回答
  •  隐瞒了意图╮
    2020-12-16 05:29

    Shoudln't a link be a well-defined regex? This is a rather theoretical question,

    I second PEZ's answer:

    I don't think HTML lends itself to "well defined" regular expressions since it's not a regular language.

    As far as I know, any HTML tag may contain any number of nested tags. For example:

    stackoverflow
    stackoverflow
    stackoverflow
    ...
    

    Thus, in principle, to match a tag properly you must be able at least to match strings of the form:

    BE
    BBEE
    BBBEEE
    ...
    BBBBBBBBBBEEEEEEEEEE
    ...
    

    where B means the beginning of a tag and E means the end. That is, you must be able to match strings formed by any number of B's followed by the same number of E's. To do that, your matcher must be able to "count", and regular expressions (i.e. finite state automata) simply cannot do that (in order to count, an automaton needs at least a stack). Referring to PEZ's answer, HTML is a context-free grammar, not a regular language.

提交回复
热议问题