Regex for links in html text

前端 未结 8 1662
旧巷少年郎
旧巷少年郎 2020-12-16 04:42

I hope this question is not a RTFM one. I am trying to write a Python script that extracts links from a standard HTML webpage (the tags). I hav

相关标签:
8条回答
  • 2020-12-16 05:44

    No there isn't.

    You can consider using Beautiful Soup. You can call it the standard for parsing html files.

    0 讨论(0)
  • 2020-12-16 05:44

    Shoudln't a link be a well-defined regex?

    No, [X]HTML is not in the general case parseable with regex. Consider examples like:

    <link title='hello">world' href="x">link</link>
    <!-- <link href="x">not a link</link> -->
    <![CDATA[ ><link href="x">not a link</link> ]]>
    <script>document.write('<link href="x">not a link</link>')</script>
    

    and that's just a few random valid examples; if you have to cope with real-world tag-soup HTML there are a million malformed possibilities.

    If you know and can rely on the exact output format of the target page you can get away with regex. Otherwise it is completely the wrong choice for scraping web pages.

    0 讨论(0)
提交回复
热议问题