beautifulsoup | 易学教程

Parsing html using BeautifulSoup in Python

阅读更多关于 Parsing html using BeautifulSoup in Python

问题 I wrote some code to parse html, but the result was not what I wanted: import urllib2 html = urllib2.urlopen('http://dummy').read() from BeautifulSoup import BeautifulSoup soup = BeautifulSoup(html) for definition in soup.findAll('span', {"class":'d'}): definition = definition.renderContents() print "<meaning>", definition for exampleofuse in soup.find('span',{"class":'x'}): print "<exampleofuse>", exampleofuse, "<exampleofuse>" print "<meaning>" Is there any kind of way that when class

Beautiful Soup can't find the part of the HTML I want

阅读更多关于 Beautiful Soup can't find the part of the HTML I want

问题 I've been using BeautifulSoup for Web Scraping for a while and this is the first time I encountered a problem like this. I am trying to select the number 101,172 in the code but even though I use .find or .select, the output is always only the tag, not the number. I worked with similar data collection before and hadn't had any problems <div class="legend-block legend-block--pageviews"> <h5>Pageviews</h5><hr> <div class="legend-block--body"> <div class="linear-legend--counts"> Pageviews: <span

BeautifulSoup does not work for some web sites

阅读更多关于 BeautifulSoup does not work for some web sites

问题 I have this sript: import urrlib2 from bs4 import BeautifulSoup url = "http://www.shoptop.ru/" page = urllib2.urlopen(url).read() soup = BeautifulSoup(page) divs = soup.findAll('a') print divs For this web site, it prints empty list? What can be problem? I am running on Ubuntu 12.04 回答1: Actually there are quite couple of bugs in BeautifulSoup which might raise some unknown errors. I had a similar issue when working on apache using lxml parser So, just try to use other couple of parsers

Encoding issue of a character in utf-8

阅读更多关于 Encoding issue of a character in utf-8

问题 I get a link from a web page by using beautiful soup library through a.get('href') . In the link there is a strange character ® but when I get it became Â® . How can I encode it properly? I have already added at the beginning of the page # -*- coding: utf-8 -*- r = requests.get(url) soup = BeautifulSoup(r.text) 回答1: Do not use r.text ; leave decoding to BeautifulSoup : soup = BeautifulSoup(r.content) r.content gives you the response in bytes, without decoding. r.text on the other hand, is the

Download all pdf files from a website using Python

阅读更多关于 Download all pdf files from a website using Python

问题 I have followed several online guides in an attempt to build a script that can identify and download all pdfs from a website to save me from doing it manually. Here is my code so far: from urllib import request from bs4 import BeautifulSoup import re import os import urllib # connect to website and get list of all pdfs url="http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2016.html" response = request.urlopen(url).read() soup= BeautifulSoup(response, "html.parser") links = soup.find_all('a',

Python Requests/BeautifulSoup access to pagination

阅读更多关于 Python Requests/BeautifulSoup access to pagination

问题 I am trying to access different pages of a website to get a list of items (20 per pages). There is one extra parameter to send to select the page but somehow i am not able to pass it along properly - the parameter has to be sent in the body of the request. I tried with params and with data without any success. What is the proper method to add soething to the "body" of a request? Here is what I have. It gives me 6 times the first page. import requests from bs4 import BeautifulSoup import time

retrieving just the title of a webpage in python

阅读更多关于 retrieving just the title of a webpage in python

问题 I have more than 5000 webpages i want the titles of all of them. In my project i am using the BeautifulSoup html parser like this. soup = BeautifulSoup(open(url).read()) soup('title')[0].string But its taking lots of time. Just for the title of a webpage i am reading the entire file and building the parse tree(I thought this is the reason for delay, correct me if i am wrong). Is there in any other simple way to do this in python. 回答1: It would certainly be faster if you just used a simple

BeautifulSoup 3.1 parser breaks far too easily

阅读更多关于 BeautifulSoup 3.1 parser breaks far too easily

问题 I was having trouble parsing some dodgy HTML with BeautifulSoup. Turns out that the HTMLParser used in newer versions is less tolerant than the SGMLParser used previously. Does BeautifulSoup have some kind of debug mode? I'm trying to figure out how to stop it borking on some nasty HTML I'm loading from a crabby website: <HTML> <HEAD> <TITLE>Title</TITLE> <HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE"> </HEAD> <BODY> ... ... </BODY> </HTML> BeautifulSoup gives up after the <HTTP-EQUIV...> tag In [1]

Having problems understanding BeautifulSoup filtering

阅读更多关于 Having problems understanding BeautifulSoup filtering

问题 Could someone please explain how the filtering works with Beautiful Soup. Ive got the below HTML I am trying to filter specific data from but I cant seem to access it. Ive tried various approaches, from gathering all class=g 's to grabbing just the items of interest in that specific div, but I just get None returns or no prints. Each page has a <div class="srg"> div with multiple <div class="g"> divs, the data i am looking to use is the data withing <div class="g"> . Each of these has

Parsing html data into python list for manipulation

阅读更多关于 Parsing html data into python list for manipulation

问题 I am trying to read in html websites and extract their data. For example, I would like to read in the EPS (earnings per share) for the past 5 years of companies. Basically, I can read it in and can use either BeautifulSoup or html2text to create a huge text block. I then want to search the file -- I have been using re.search -- but can't seem to get it to work properly. Here is the line I am trying to access: EPS (Basic)\n13.4620.6226.6930.1732.81\n\n So I would like to create a list called