beautifulsoup | 易学教程

Beautifulsoup - How to get all links inside a block with a certain class?

阅读更多关于 Beautifulsoup - How to get all links inside a block with a certain class?

问题 I have the following HTML Dom: <div class="meta-info meta-info-wide"> <div class="title">Разработчик</div> <div class="content contains-text-link"> <a class="dev-link" href="http://www.jourist.com&sa=D&usg=AFQjCNHiC-nLYHAJwNnvDyYhyoeB6n8YKg" rel="nofollow" target="_blank">Перейти на веб-сайт</a> <a class="dev-link" href="mailto:info@jourist.com" rel="nofollow" target="_blank">Написать: info@jourist.com</a> <div class="content physical-address">Diagonalstraße 41 20537 Hamburg</div> </div> <

Use BeautifulSoup to Iterate over XML to pull specific tags and store in variable

阅读更多关于 Use BeautifulSoup to Iterate over XML to pull specific tags and store in variable

问题 I'm fairly new to programming and have been trying to find a solution for this but all I can find are bits and pieces with no real luck putting it all together. I'm trying to use BeautifulSoup4 in python to scrape some xml and store the text value in between specific tags in variables. The data is from a med student training program and right now everything needed has to be found manually. So I'm trying to increase efficiency a bit with a scraping program. Let's say for example that I was

beautifulsoup and invalid html document

阅读更多关于 beautifulsoup and invalid html document

问题 I am trying to parse the document http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/acf8e.htm. I want to get countries and names at the beginning of the document. Here is my code import urllib import re from bs4 import BeautifulSoup url="http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/acf8e.htm" soup=BeautifulSoup(urllib.urlopen(url)) attendances_table=soup.find("table", {"width":850}) print attendances_table #this works, I see the whole table

How to get the option text using BeautifulSoup

阅读更多关于 How to get the option text using BeautifulSoup

问题 I want to using BeautifulSoup to get the option text in the following html. For example: I'd like to get 2002/12 , 2003/12 etc. <select id="start_dateid"> <option value="0">2002/12</option> <option value="1">2003/12</option> <option value="2">2004/12</option> <option value="3">2005/12</option> <option value="4">2006/12</option> <option value="5" selected="">2007/12</option> <option value="6">2008/12</option> <option value="7">2009/12</option> <option value="8">2010/12</option> <option value=

python BeautifulSoup searching a tag

阅读更多关于 python BeautifulSoup searching a tag

问题 My first post here, I'm trying to find all tags in this specific html and i can't get them out, this is the code: from bs4 import BeautifulSoup from urllib import urlopen url = "http://www.jutarnji.hr" html_doc = urlopen(url).read() soup = BeautifulSoup(html_doc) soup.prettify() soup.find_all("a", {"class":"black"}) find function returns [], but i see that there are tags with class:"black" in the html, do I miss something? Thanks, Vedran 回答1: I also had same problem. Try soup.findAll("a",{

Cannot import Beautiful Soup

阅读更多关于 Cannot import Beautiful Soup

问题 I am trying to use BeautifulSoup, and despite using the import statement: from bs4 import BeautifulSoup I am getting the error: ImportError: cannot import name BeautifulSoup import bs4 does not give any errors. I have also tried import bs4.BeautifulSoup and just importing bs4 and creating a BeautifulSoup object with: bs4.BeautifulSoup() Any guidance would be appreciated. 回答1: The issue was I named the file HTMLParser.py , and that name is already used somewhere in the bs4 module. Thanks to

Why is Python insisting on using ascii?

阅读更多关于 Why is Python insisting on using ascii?

问题 When parsing an HTML file with Requests and Beautiful Soup, the following line is throwing an exception on some web pages: if 'var' in str(tag.string): Here is the context: response = requests.get(url) soup = bs4.BeautifulSoup(response.text.encode('utf-8')) for tag in soup.findAll('script'): if 'var' in str(tag.string): # This is the line throwing the exception print(tag.string) Here is the exception: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 15: ordinal not in

Please help parse this html table using BeautifulSoup and lxml the pythonic way

阅读更多关于 Please help parse this html table using BeautifulSoup and lxml the pythonic way

问题 I have searched a lot about BeautifulSoup and some suggested lxml as the future of BeautifulSoup while that makes sense, I am having a tough time parsing the following table from a whole list of tables on the webpage. I am interested in the three columns with varied number of rows depending on the page and the time it was checked. A BeautifulSoup and lxml solution is well appreciated. That way I can ask the admin to install lxml on the dev. machine. Desired output : Website Last Visited Last

Python : Another 'NoneType' object has no attribute error

阅读更多关于 Python : Another 'NoneType' object has no attribute error

问题 For a newbie exercise , I am trying to find the meta tag in a html file and extract the generator so I did like this : Version = soup.find("meta", {"name":"generator"})['content'] and since I had this error : TypeError: 'NoneType' object has no attribute '__getitem__' I was thinking that working with exception would correct it, so I wrote : try: Version = soup.find("meta", {"name":"generator"})['content'] except NameError,TypeError: print "Not found" and what I got is the same error. What

Beautifulsoup and Soupstrainer for getting links dont work with hasattr, returning always true

阅读更多关于 Beautifulsoup and Soupstrainer for getting links dont work with hasattr, returning always true

问题 i am using Beautifulsoup4 and Soupstrainer with Python 3.3 for getting all links from a webpage. The following is the important code-snippet: r = requests.get(adress, headers=headers) for link in BeautifulSoup(r.text, parse_only=SoupStrainer('a')): if hasattr(link, 'href'): I tested some webpages and it works very well but today when using adress = 'http://www.goldentigercasino.de/' I recognized that hasattr(link, 'href') always returns TRUE even when there is no such 'href' field, like in