beautifulsoup

Unable to save image from web using urllib2

蓝咒 提交于 2019-12-23 19:59:16
问题 I want to save some images from a website using python urllib2 but when I run the code it saves something else. This is my code: user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' headers = { 'User-Agent' : user_agent } url = "http://m.jaaar.com/" r = urllib2.Request(url, headers=headers) page = urllib2.urlopen(r).read() soup = BeautifulSoup(page) imgTags = soup.findAll('img') imgTags = imgTags[1:] for imgTag in imgTags: imgUrl = "http://www.jaaar.com" + imgTag['src'] imgUrl =

Compressing “n”-time object member call

心已入冬 提交于 2019-12-23 18:47:00
问题 Is there any non-explicit for way to call a member n times upon an object? I was thinking about some map/reduce/lambda approach, but I couldn't figure out a way to do this -- if it's possible. Just to add context, I'm using BeautifulSoup , and I'm extracting some elements from an html table; I extract some elements, and then, the last one. Since I have: # First value print value.text # Second value value = value.nextSibling print value.text # Ninth value for i in xrange(1, 7): value = value

Scraping a webpage that has JavaScript with BeautifulSoup

泪湿孤枕 提交于 2019-12-23 16:51:08
问题 guys! I am applying to you once again. I am ok with scraping simple websites with tags but recently I've encountered a quite complex website which has JavaScript. As a result I would like to obtain all the estimates at the bottom of the page in a format of table (csv). Like 'User', 'Revenue estimate', 'EPS estimate'. I hoped to figure it by myself but kinda failed. Here is my code: from urllib import urlopen from bs4 import BeautifulSoup html = urlopen("https://www.estimize.com/jpm/fq3-2016

Extracting title from HTML not working

放肆的年华 提交于 2019-12-23 16:48:00
问题 I'm performing some text analytics on a large number of novels downloaded from Gutenberg. I want to keep as much metadata as a I can, so I'm downloading as html then later converting to text. My problem is extracting the metadata from the html files, in particular, the title of each novel. As of now, I'm using BeautifulSoup to generate the text files and extract the title. For an example text of Jane Eyre, my code is as follows: from bs4 import BeautifulSoup ### Opens html file html = open(

How to modify an html tree in python?

瘦欲@ 提交于 2019-12-23 16:19:25
问题 Suppose there is some variable fragment html code <p> <span class="code"> string 1 </ span> <span class="code"> string 2 </ span> <span class="code"> string 3 </ span> </ p> <p> <span class="any"> Some text </ span> </ p> I need to modify the contents of all the tags with the class code <span> skipping content through some function, such as foo , which returns the contents of the modified tag <span> . Ultimately, I should get a new piece of html document like this: <p> <span class="code">

BeautifulSoup - combine consecutive tags

此生再无相见时 提交于 2019-12-23 12:31:26
问题 I have to work with the messiest HTML where individual words are split into separate tags, like in the following example: <b style="mso-bidi-font-weight:normal"><span style='font-size:14.0pt;mso-bidi-font-size:11.0pt;line-height:107%;font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>I</span></b><b style="mso-bidi-font-weight:normal"><span style='font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>NTRODUCTION</span></b> That's kind of

'NoneType' object is not callable beautifulsoup error while using get_text

心不动则不痛 提交于 2019-12-23 12:24:34
问题 I wrote this code for extracting all text from a web page: from BeautifulSoup import BeautifulSoup import urllib2 soup = BeautifulSoup(urllib2.urlopen('http://www.pythonforbeginners.com').read()) print(soup.get_text()) The problem is I get this error: print(soup.get_text()) TypeError: 'NoneType' object is not callable Any idea about how to solve this? 回答1: The method is called soup.getText() , i.e. camelCased. Why you get TypeError instead of AttributeError here is a mystery to me! 回答2: As

What does “module object is not callable” mean?

本小妞迷上赌 提交于 2019-12-23 12:24:29
问题 I'm using the .get_data() method with mechanize, which appears to print out the html that I want. I also check the type of what it prints out, and the type is 'str'. But when I try to parse the str with BeautifulSoup, I get the following error: --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-163-11c061bf6c04> in <module>() 7 html = get_html(first[i],last[i]) 8 print type(html) ----> 9 print parse_page(html)

How to find element based on text ignore child tags in beautifulsoup

大城市里の小女人 提交于 2019-12-23 12:01:39
问题 I am looking for a solution using Python and BeautifulSoup to find an element based on the inside text. For example: <div> <b>Ignore this text</b>Find based on this text </div> How can I find this div? Thanks for you helps! 回答1: You can use .find with the text argument and then use findParent to the parent element. Ex: from bs4 import BeautifulSoup s="""<div> <b>Ignore this text</b>Find based on this text </div>""" soup = BeautifulSoup(s, 'html.parser') t = soup.find(text="Find based on this

Maximum recursion depth exceeded. Multiprocessing and bs4

时间秒杀一切 提交于 2019-12-23 11:06:09
问题 I'm trying to make a parser use beautifulSoup and multiprocessing. I have an error: RecursionError: maximum recursion depth exceeded My code is: import bs4, requests, time from multiprocessing.pool import Pool html = requests.get('https://www.avito.ru/moskva/avtomobili/bmw/x6?sgtd=5&radius=0') soup = bs4.BeautifulSoup(html.text, "html.parser") divList = soup.find_all("div", {'class': 'item_table-header'}) def new_check(): with Pool() as pool: pool.map(get_info, divList) def get_info(each):