beautifulsoup | 易学教程

Unable to save image from web using urllib2

阅读更多关于 Unable to save image from web using urllib2

问题 I want to save some images from a website using python urllib2 but when I run the code it saves something else. This is my code: user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' headers = { 'User-Agent' : user_agent } url = "http://m.jaaar.com/" r = urllib2.Request(url, headers=headers) page = urllib2.urlopen(r).read() soup = BeautifulSoup(page) imgTags = soup.findAll('img') imgTags = imgTags[1:] for imgTag in imgTags: imgUrl = "http://www.jaaar.com" + imgTag['src'] imgUrl =

Compressing “n”-time object member call

阅读更多关于 Compressing “n”-time object member call

问题 Is there any non-explicit for way to call a member n times upon an object? I was thinking about some map/reduce/lambda approach, but I couldn't figure out a way to do this -- if it's possible. Just to add context, I'm using BeautifulSoup , and I'm extracting some elements from an html table; I extract some elements, and then, the last one. Since I have: # First value print value.text # Second value value = value.nextSibling print value.text # Ninth value for i in xrange(1, 7): value = value

Scraping a webpage that has JavaScript with BeautifulSoup

阅读更多关于 Scraping a webpage that has JavaScript with BeautifulSoup

问题 guys! I am applying to you once again. I am ok with scraping simple websites with tags but recently I've encountered a quite complex website which has JavaScript. As a result I would like to obtain all the estimates at the bottom of the page in a format of table (csv). Like 'User', 'Revenue estimate', 'EPS estimate'. I hoped to figure it by myself but kinda failed. Here is my code: from urllib import urlopen from bs4 import BeautifulSoup html = urlopen("https://www.estimize.com/jpm/fq3-2016

Extracting title from HTML not working

阅读更多关于 Extracting title from HTML not working

问题 I'm performing some text analytics on a large number of novels downloaded from Gutenberg. I want to keep as much metadata as a I can, so I'm downloading as html then later converting to text. My problem is extracting the metadata from the html files, in particular, the title of each novel. As of now, I'm using BeautifulSoup to generate the text files and extract the title. For an example text of Jane Eyre, my code is as follows: from bs4 import BeautifulSoup ### Opens html file html = open(

How to modify an html tree in python?

阅读更多关于 How to modify an html tree in python?

问题 Suppose there is some variable fragment html code string 1 string 2 string 3 Some text I need to modify the contents of all the tags with the class code skipping content through some function, such as foo , which returns the contents of the modified tag . Ultimately, I should get a new piece of html document like this:

BeautifulSoup - combine consecutive tags

阅读更多关于 BeautifulSoup - combine consecutive tags

问题 I have to work with the messiest HTML where individual words are split into separate tags, like in the following example: INTRODUCTION That's kind of

'NoneType' object is not callable beautifulsoup error while using get_text

阅读更多关于 'NoneType' object is not callable beautifulsoup error while using get_text

问题 I wrote this code for extracting all text from a web page: from BeautifulSoup import BeautifulSoup import urllib2 soup = BeautifulSoup(urllib2.urlopen('http://www.pythonforbeginners.com').read()) print(soup.get_text()) The problem is I get this error: print(soup.get_text()) TypeError: 'NoneType' object is not callable Any idea about how to solve this? 回答1: The method is called soup.getText() , i.e. camelCased. Why you get TypeError instead of AttributeError here is a mystery to me! 回答2: As

What does “module object is not callable” mean?

阅读更多关于 What does “module object is not callable” mean?

问题 I'm using the .get_data() method with mechanize, which appears to print out the html that I want. I also check the type of what it prints out, and the type is 'str'. But when I try to parse the str with BeautifulSoup, I get the following error: --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-163-11c061bf6c04> in <module>() 7 html = get_html(first[i],last[i]) 8 print type(html) ----> 9 print parse_page(html)

How to find element based on text ignore child tags in beautifulsoup

阅读更多关于 How to find element based on text ignore child tags in beautifulsoup

问题 I am looking for a solution using Python and BeautifulSoup to find an element based on the inside text. For example: <div> Ignore this textFind based on this text </div> How can I find this div? Thanks for you helps! 回答1: You can use .find with the text argument and then use findParent to the parent element. Ex: from bs4 import BeautifulSoup s="""<div> Ignore this textFind based on this text </div>""" soup = BeautifulSoup(s, 'html.parser') t = soup.find(text="Find based on this

Maximum recursion depth exceeded. Multiprocessing and bs4

阅读更多关于 Maximum recursion depth exceeded. Multiprocessing and bs4

问题 I'm trying to make a parser use beautifulSoup and multiprocessing. I have an error: RecursionError: maximum recursion depth exceeded My code is: import bs4, requests, time from multiprocessing.pool import Pool html = requests.get('https://www.avito.ru/moskva/avtomobili/bmw/x6?sgtd=5&radius=0') soup = bs4.BeautifulSoup(html.text, "html.parser") divList = soup.find_all("div", {'class': 'item_table-header'}) def new_check(): with Pool() as pool: pool.map(get_info, divList) def get_info(each):