Extracting contents from specific meta tags that are not closed using BeautifulSoup

前端 未结 6 1363
孤街浪徒
孤街浪徒 2020-12-28 09:34

I\'m trying to parse out content from specific meta tags. Here\'s the structure of the meta tags. The first two are closed with a backslash, but the rest don\'t have any clo

6条回答
  •  轻奢々
    轻奢々 (楼主)
    2020-12-28 10:05

    Edited: Added regex for case sensitivity as suggested by @Albert Chen.

    Python 3 Edit:

    from bs4 import BeautifulSoup
    import re
    import urllib.request
    
    page3 = urllib.request.urlopen("https://angel.co/uber").read()
    soup3 = BeautifulSoup(page3)
    
    desc = soup3.findAll(attrs={"name": re.compile(r"description", re.I)}) 
    print(desc[0]['content'])
    

    Although I'm not sure it will work for every page:

    from bs4 import BeautifulSoup
    import re
    import urllib
    
    page3 = urllib.urlopen("https://angel.co/uber").read()
    soup3 = BeautifulSoup(page3)
    
    desc = soup3.findAll(attrs={"name": re.compile(r"description", re.I)}) 
    print(desc[0]['content'].encode('utf-8'))
    

    Yields:

    Learn about Uber's product, founders, investors and team. Everyone's Private Dri
    ver - Request a car from any mobile phoneΓÇötext message, iPhone and Android app
    s. Within minutes, a professional driver in a sleek black car will arrive curbsi
    de. Automatically charged to your credit card on file, tip included.
    

提交回复
热议问题