beautifulsoup

malformed start tag error - Python, BeautifulSoup, and Sipie - Ubuntu 10.04

假装没事ソ 提交于 2020-01-12 07:26:06
问题 I just installed python, mplayer, beautifulsoup and sipie to run Sirius on my Ubuntu 10.04 machine. I followed some docs that seem straightforward, but am encountering some issues. I'm not that familiar with Python, so this may be out of my league. I was able to get everything installed, but then running sipie gives this: /usr/bin/Sipie/Sipie/Config.py:12: DeprecationWarning: the md5 module is deprecated; use hashlib instead import md5 Traceback (most recent call last): File "/usr/bin/Sipie

Depth First Traversal on BeautifulSoup Parse Tree

五迷三道 提交于 2020-01-12 04:49:27
问题 Is there a way to do a DFT on a BeautifulSoup parse tree? I'm trying to do something like starting at the root, usually , get all the child elements and then for each child element get their children, etc until I hit a terminal node at which point I'll build my way back up the tree. Problem is I can't seem to find a method that will allow me to do this. I found the findChildren method but that seems to just put the entire page in a list multiple times with each subsequent entry getting

Extract CSS from href links

≡放荡痞女 提交于 2020-01-11 13:21:32
问题 This is the code to extract all the href links of a website by passing url of the website. from BeautifulSoup import BeautifulSoup import urllib2 import re html_page = urllib2.urlopen("http://kteq.in/services") soup = BeautifulSoup(html_page) for link in soup.findAll('a'): if link.get('href')==None: continue result = re.sub(r"http\S+", "", link.get('href')) print result When I run the above code, the href links of that website are extracted. I get the following output. index index # solutions

<urlopen error [Errno 1] _ssl.c:510: error:14077417:SSL

徘徊边缘 提交于 2020-01-11 11:47:31
问题 Does anyone know why I am getting this error? SSLError: [Errno 1] _ssl.c:510: error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 I get the erro when using requests or urllib2, I'm running the code on Kodi. The code runs fine when I run it on Visual Studio on my PC. I am trying to scrape a website that is blocked by my ISP, so I'm using a proxy version of the site. import requests url = 'https://kickass.unblocked.pe/' r = requests.get(url) 回答1: The site is hosted by Cloudflare Free SSL

How can I grab the element by matching text in its attribute in BeautifulSoup

微笑、不失礼 提交于 2020-01-11 11:33:09
问题 I have this code <a title="Next Page - Results 1 to 60 " href="bla bla" class="smallfont" rel="next">></a> I want to grab the a element and get the href . how can I match the title attribute with Next Page I want to partially match the text in title attribute of the a element. There are many a tags on the page similar to it but only difference is that the title attribute contains "Next Page or the text is > . 回答1: You would have to use Regex for accomplishing what you want. First take the

Extracting HTML content from a search page using Beautiful Soup with Python

不问归期 提交于 2020-01-11 11:25:27
问题 I'm trying to get some hotels info from booking.com using Beautiful Soup. I need to get certain info from all the accomodations in Spain. This is the search url: https://www.booking.com/searchresults.html?aid=304142&label=gen173nr-1DCAEoggJCAlhYSDNYBGigAYgBAZgBMbgBB8gBDNgBA-gBAfgBApICAXmoAgM&sid=1677838e3fc7c26577ea908d40ad5faf&class_interval=1&dest_id=197&dest_type=country&dtdisc=0&from_sf=1&group_adults=2&group_children=0&inac=0&index_postcard=0&label_click=undef&no_rooms=1&oos_flag=0

Extracting HTML content from a search page using Beautiful Soup with Python

有些话、适合烂在心里 提交于 2020-01-11 11:25:08
问题 I'm trying to get some hotels info from booking.com using Beautiful Soup. I need to get certain info from all the accomodations in Spain. This is the search url: https://www.booking.com/searchresults.html?aid=304142&label=gen173nr-1DCAEoggJCAlhYSDNYBGigAYgBAZgBMbgBB8gBDNgBA-gBAfgBApICAXmoAgM&sid=1677838e3fc7c26577ea908d40ad5faf&class_interval=1&dest_id=197&dest_type=country&dtdisc=0&from_sf=1&group_adults=2&group_children=0&inac=0&index_postcard=0&label_click=undef&no_rooms=1&oos_flag=0

Add parent tags with beautiful soup

半城伤御伤魂 提交于 2020-01-11 10:38:36
问题 I have many pages of HTML with various sections containing these code snippets: <div class="footnote" id="footnote-1"> <h3>Reference:</h3> <table cellpadding="0" cellspacing="0" class="floater" style="margin-bottom:0;" width="100%"> <tr> <td valign="top" width="20px"> <a href="javascript:void(0);" onclick='javascript:toggleFootnote("footnote-1");' title="click to hide this reference">1.</a> </td> <td> <p> blah </p> </td> </tr> </table> </div> I can parse the HTML successfully and extract

Python BeautifulSoup not scraping this url

大憨熊 提交于 2020-01-11 07:26:50
问题 I am trying to scrape some rows of player data (tr) from a url, however nothing appears to happen when I run my code. I am positive my code is fine because it works with other statistical websites containing tables. Can anyone tell me why nothing is happening? Thanks in advance. import urllib import urllib.request from bs4 import BeautifulSoup def make_soup(url): thepage = urllib.request.urlopen(url) soupdata = BeautifulSoup(thepage, "html.parser") return soupdata soup = make_soup("https:/

BeautifulSoup: do not add spaces where they matter, remove them where they don't

吃可爱长大的小学妹 提交于 2020-01-11 06:43:08
问题 This sample python program: document='''<p>This is <i>something</i>, it happens in <b>real</b> life</p>''' from bs4 import BeautifulSoup soup = BeautifulSoup(document) print(soup.prettify()) produces the following output: <html> <body> <p> This is <i> something </i> , it happens in <b> real </b> life </p> </body> </html> That's wrong, because it adds whitespace before and after each opening and closing tag and, for example, there should be no space between </i> and , . I would like it to: Not