beautifulsoup

Beautifulsoup - How to get all links inside a block with a certain class?

我的梦境 提交于 2019-12-22 07:51:56
问题 I have the following HTML Dom: <div class="meta-info meta-info-wide"> <div class="title">Разработчик</div> <div class="content contains-text-link"> <a class="dev-link" href="http://www.jourist.com&sa=D&usg=AFQjCNHiC-nLYHAJwNnvDyYhyoeB6n8YKg" rel="nofollow" target="_blank">Перейти на веб-сайт</a> <a class="dev-link" href="mailto:info@jourist.com" rel="nofollow" target="_blank">Написать: info@jourist.com</a> <div class="content physical-address">Diagonalstraße 41 20537 Hamburg</div> </div> <

Use BeautifulSoup to Iterate over XML to pull specific tags and store in variable

走远了吗. 提交于 2019-12-22 07:48:08
问题 I'm fairly new to programming and have been trying to find a solution for this but all I can find are bits and pieces with no real luck putting it all together. I'm trying to use BeautifulSoup4 in python to scrape some xml and store the text value in between specific tags in variables. The data is from a med student training program and right now everything needed has to be found manually. So I'm trying to increase efficiency a bit with a scraping program. Let's say for example that I was

beautifulsoup and invalid html document

只愿长相守 提交于 2019-12-22 06:03:49
问题 I am trying to parse the document http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/acf8e.htm. I want to get countries and names at the beginning of the document. Here is my code import urllib import re from bs4 import BeautifulSoup url="http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/acf8e.htm" soup=BeautifulSoup(urllib.urlopen(url)) attendances_table=soup.find("table", {"width":850}) print attendances_table #this works, I see the whole table

How to get the option text using BeautifulSoup

北战南征 提交于 2019-12-22 05:24:40
问题 I want to using BeautifulSoup to get the option text in the following html. For example: I'd like to get 2002/12 , 2003/12 etc. <select id="start_dateid"> <option value="0">2002/12</option> <option value="1">2003/12</option> <option value="2">2004/12</option> <option value="3">2005/12</option> <option value="4">2006/12</option> <option value="5" selected="">2007/12</option> <option value="6">2008/12</option> <option value="7">2009/12</option> <option value="8">2010/12</option> <option value=

python BeautifulSoup searching a tag

一个人想着一个人 提交于 2019-12-22 05:24:34
问题 My first post here, I'm trying to find all tags in this specific html and i can't get them out, this is the code: from bs4 import BeautifulSoup from urllib import urlopen url = "http://www.jutarnji.hr" html_doc = urlopen(url).read() soup = BeautifulSoup(html_doc) soup.prettify() soup.find_all("a", {"class":"black"}) find function returns [], but i see that there are tags with class:"black" in the html, do I miss something? Thanks, Vedran 回答1: I also had same problem. Try soup.findAll("a",{

Cannot import Beautiful Soup

僤鯓⒐⒋嵵緔 提交于 2019-12-22 03:33:08
问题 I am trying to use BeautifulSoup, and despite using the import statement: from bs4 import BeautifulSoup I am getting the error: ImportError: cannot import name BeautifulSoup import bs4 does not give any errors. I have also tried import bs4.BeautifulSoup and just importing bs4 and creating a BeautifulSoup object with: bs4.BeautifulSoup() Any guidance would be appreciated. 回答1: The issue was I named the file HTMLParser.py , and that name is already used somewhere in the bs4 module. Thanks to

Why is Python insisting on using ascii?

老子叫甜甜 提交于 2019-12-22 01:32:19
问题 When parsing an HTML file with Requests and Beautiful Soup, the following line is throwing an exception on some web pages: if 'var' in str(tag.string): Here is the context: response = requests.get(url) soup = bs4.BeautifulSoup(response.text.encode('utf-8')) for tag in soup.findAll('script'): if 'var' in str(tag.string): # This is the line throwing the exception print(tag.string) Here is the exception: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 15: ordinal not in

Please help parse this html table using BeautifulSoup and lxml the pythonic way

懵懂的女人 提交于 2019-12-22 01:32:09
问题 I have searched a lot about BeautifulSoup and some suggested lxml as the future of BeautifulSoup while that makes sense, I am having a tough time parsing the following table from a whole list of tables on the webpage. I am interested in the three columns with varied number of rows depending on the page and the time it was checked. A BeautifulSoup and lxml solution is well appreciated. That way I can ask the admin to install lxml on the dev. machine. Desired output : Website Last Visited Last

Python : Another 'NoneType' object has no attribute error

北慕城南 提交于 2019-12-22 01:23:57
问题 For a newbie exercise , I am trying to find the meta tag in a html file and extract the generator so I did like this : Version = soup.find("meta", {"name":"generator"})['content'] and since I had this error : TypeError: 'NoneType' object has no attribute '__getitem__' I was thinking that working with exception would correct it, so I wrote : try: Version = soup.find("meta", {"name":"generator"})['content'] except NameError,TypeError: print "Not found" and what I got is the same error. What

Beautifulsoup and Soupstrainer for getting links dont work with hasattr, returning always true

♀尐吖头ヾ 提交于 2019-12-22 01:16:59
问题 i am using Beautifulsoup4 and Soupstrainer with Python 3.3 for getting all links from a webpage. The following is the important code-snippet: r = requests.get(adress, headers=headers) for link in BeautifulSoup(r.text, parse_only=SoupStrainer('a')): if hasattr(link, 'href'): I tested some webpages and it works very well but today when using adress = 'http://www.goldentigercasino.de/' I recognized that hasattr(link, 'href') always returns TRUE even when there is no such 'href' field, like in