beautifulsoup | 易学教程

malformed start tag error - Python, BeautifulSoup, and Sipie - Ubuntu 10.04

阅读更多关于 malformed start tag error - Python, BeautifulSoup, and Sipie - Ubuntu 10.04

问题 I just installed python, mplayer, beautifulsoup and sipie to run Sirius on my Ubuntu 10.04 machine. I followed some docs that seem straightforward, but am encountering some issues. I'm not that familiar with Python, so this may be out of my league. I was able to get everything installed, but then running sipie gives this: /usr/bin/Sipie/Sipie/Config.py:12: DeprecationWarning: the md5 module is deprecated; use hashlib instead import md5 Traceback (most recent call last): File "/usr/bin/Sipie

Depth First Traversal on BeautifulSoup Parse Tree

阅读更多关于 Depth First Traversal on BeautifulSoup Parse Tree

问题 Is there a way to do a DFT on a BeautifulSoup parse tree? I'm trying to do something like starting at the root, usually , get all the child elements and then for each child element get their children, etc until I hit a terminal node at which point I'll build my way back up the tree. Problem is I can't seem to find a method that will allow me to do this. I found the findChildren method but that seems to just put the entire page in a list multiple times with each subsequent entry getting

Extract CSS from href links

阅读更多关于 Extract CSS from href links

问题 This is the code to extract all the href links of a website by passing url of the website. from BeautifulSoup import BeautifulSoup import urllib2 import re html_page = urllib2.urlopen("http://kteq.in/services") soup = BeautifulSoup(html_page) for link in soup.findAll('a'): if link.get('href')==None: continue result = re.sub(r"http\S+", "", link.get('href')) print result When I run the above code, the href links of that website are extracted. I get the following output. index index # solutions

<urlopen error [Errno 1] _ssl.c:510: error:14077417:SSL

阅读更多关于

问题 Does anyone know why I am getting this error? SSLError: [Errno 1] _ssl.c:510: error:14077438:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 I get the erro when using requests or urllib2, I'm running the code on Kodi. The code runs fine when I run it on Visual Studio on my PC. I am trying to scrape a website that is blocked by my ISP, so I'm using a proxy version of the site. import requests url = 'https://kickass.unblocked.pe/' r = requests.get(url) 回答1: The site is hosted by Cloudflare Free SSL

How can I grab the element by matching text in its attribute in BeautifulSoup

阅读更多关于 How can I grab the element by matching text in its attribute in BeautifulSoup

问题 I have this code <a title="Next Page - Results 1 to 60 " href="bla bla" class="smallfont" rel="next">></a> I want to grab the a element and get the href . how can I match the title attribute with Next Page I want to partially match the text in title attribute of the a element. There are many a tags on the page similar to it but only difference is that the title attribute contains "Next Page or the text is > . 回答1: You would have to use Regex for accomplishing what you want. First take the

Extracting HTML content from a search page using Beautiful Soup with Python

阅读更多关于 Extracting HTML content from a search page using Beautiful Soup with Python

问题 I'm trying to get some hotels info from booking.com using Beautiful Soup. I need to get certain info from all the accomodations in Spain. This is the search url: https://www.booking.com/searchresults.html?aid=304142&label=gen173nr-1DCAEoggJCAlhYSDNYBGigAYgBAZgBMbgBB8gBDNgBA-gBAfgBApICAXmoAgM&sid=1677838e3fc7c26577ea908d40ad5faf&class_interval=1&dest_id=197&dest_type=country&dtdisc=0&from_sf=1&group_adults=2&group_children=0&inac=0&index_postcard=0&label_click=undef&no_rooms=1&oos_flag=0

Extracting HTML content from a search page using Beautiful Soup with Python

阅读更多关于 Extracting HTML content from a search page using Beautiful Soup with Python

Add parent tags with beautiful soup

阅读更多关于 Add parent tags with beautiful soup

问题 I have many pages of HTML with various sections containing these code snippets: <div class="footnote" id="footnote-1"> <h3>Reference:</h3> <table cellpadding="0" cellspacing="0" class="floater" style="margin-bottom:0;" width="100%"> <tr> <td valign="top" width="20px"> <a href="javascript:void(0);" onclick='javascript:toggleFootnote("footnote-1");' title="click to hide this reference">1.</a> </td> <td> <p> blah </p> </td> </tr> </table> </div> I can parse the HTML successfully and extract

Python BeautifulSoup not scraping this url

阅读更多关于 Python BeautifulSoup not scraping this url

问题 I am trying to scrape some rows of player data (tr) from a url, however nothing appears to happen when I run my code. I am positive my code is fine because it works with other statistical websites containing tables. Can anyone tell me why nothing is happening? Thanks in advance. import urllib import urllib.request from bs4 import BeautifulSoup def make_soup(url): thepage = urllib.request.urlopen(url) soupdata = BeautifulSoup(thepage, "html.parser") return soupdata soup = make_soup("https:/

BeautifulSoup: do not add spaces where they matter, remove them where they don't

阅读更多关于 BeautifulSoup: do not add spaces where they matter, remove them where they don't

问题 This sample python program: document='''<p>This is <i>something</i>, it happens in <b>real</b> life</p>''' from bs4 import BeautifulSoup soup = BeautifulSoup(document) print(soup.prettify()) produces the following output: <html> <body> <p> This is <i> something </i> , it happens in <b> real </b> life </p> </body> </html> That's wrong, because it adds whitespace before and after each opening and closing tag and, for example, there should be no space between </i> and , . I would like it to: Not