beautifulsoup

Download files using requests and BeautifulSoup

时光总嘲笑我的痴心妄想 提交于 2019-12-20 17:14:54
问题 I'm trying download a bunch of pdf files from here using requests and beautifulsoup4 . This is my code: import requests from bs4 import BeautifulSoup as bs _ANO = '2013/' _MES = '01/' _MATERIAS = 'matematica/' _CONTEXT = 'wp-content/uploads/' + _ANO + _MES _URL = 'http://www.desconversa.com.br/' + _MATERIAS + _CONTEXT r = requests.get(_URL) soup = bs(r.text) for i, link in enumerate(soup.findAll('a')): _FULLURL = _URL + link.get('href') for x in range(i): output = open('file[%d].pdf' % x, 'wb

Scraping multiple paginated links with BeautifulSoup and Requests

偶尔善良 提交于 2019-12-20 15:22:04
问题 Python Beginner here. I'm trying to scrape all products from one category on dabs.com. I've managed to scrape all products on a given page, but I'm having trouble iterating over all the paginated links. Right now, I've tried to isolate all the pagination buttons with the span class='page-list" but even that isn't working. Ideally, I would like to make the crawler keep clicking next until it has scraped all products on all pages. How can I do this? Really appreciate any input from bs4 import

writing and saving CSV file from scraping data using python and Beautifulsoup4

懵懂的女人 提交于 2019-12-20 10:55:24
问题 I am trying to scrape data from the PGA.com website to get a table of all of the golf courses in the United States. In my CSV table I want to include the Name of the golf course ,Address ,Ownership ,Website , Phone number. With this data I would like to geocode it and place into a map and have a local copy on my computer I utilized Python and Beautiful Soup4 to extract my data. I have reached as far to extract the data from the website but I am having difficulty on writing the script to

extracting element and insert a space

為{幸葍}努か 提交于 2019-12-20 10:31:36
问题 im parsing html using BeautifulSoup in python i dont know how to insert a space when extracting text element this is the code: import BeautifulSoup soup=BeautifulSoup.BeautifulSoup('<html>this<b>is</b>example</html>') print soup.text then output is thisisexample but i want to insert a space to this like yes is example how do i insert a space? 回答1: Use getText instead: import BeautifulSoup soup=BeautifulSoup.BeautifulSoup('<html>this<b>is</b>example</html>') print soup.getText(separator=u' ')

Get contents by class names using Beautiful Soup

陌路散爱 提交于 2019-12-20 09:56:21
问题 Using Beautiful Soup module, how can I get data of a div tag whose class name is feeditemcontent cxfeeditemcontent ? Is it: soup.class['feeditemcontent cxfeeditemcontent'] or: soup.find_all('class') This is the HTML source: <div class="feeditemcontent cxfeeditemcontent"> <div class="feeditembodyandfooter"> <div class="feeditembody"> <span>The actual data is some where here</span> </div> </div> </div> and this is the Python code: from BeautifulSoup import BeautifulSoup html_doc = open('home

Difference between BeautifulSoup and Scrapy crawler?

给你一囗甜甜゛ 提交于 2019-12-20 07:56:52
问题 I want to make a website that shows the comparison between amazon and e-bay product price. Which of these will work better and why? I am somewhat familiar with BeautifulSoup but not so much with Scrapy crawler . 回答1: Scrapy is a Web-spider or web scraper framework , You give Scrapy a root URL to start crawling, then you can specify constraints on how many (number of) URLs you want to crawl and fetch,etc. It is a complete framework for web-scraping or crawling . While BeautifulSoup is a

Why cycle repeats and doesn't change variable? [closed]

半城伤御伤魂 提交于 2019-12-20 07:51:40
问题 Closed . This question needs details or clarity. It is not currently accepting answers. Want to improve this question? Add details and clarify the problem by editing this post. Closed 3 days ago . #import libraries import requests from bs4 import BeautifulSoup links = set() #"skeleton" of url base_url = 'https://steamcommunity.com/market/search?appid=730&q=#p{}' #site has 1300 pages, and i want to parse all of them count = 1301 for i in range(count): url = base_url.format(i) #send get request

Beautifulsoup find the tag and attribute of without value?

只谈情不闲聊 提交于 2019-12-20 07:30:04
问题 I'm trying to get the content of the particular tag which having the attribute but no values. How can I get it for example cont = '<nav></nav> <nav breadcrumbs> <a href="">aa</a></nav> <nav></nav>' From the above one I want to extract the <nav breadcrumbs> <a href="">aa</a></nav> So I have tried the following one bread = contSoup.find("nav",{"breadcrumbs":""}) I have tried below one also bread = contSoup.find("nav breadcrumbs") Finally I'm using RegEx to get this data, I'm able to get the

BeautifulSoup fails to parse long view state

房东的猫 提交于 2019-12-20 06:38:52
问题 I try to use BeautifulSoup4 to parse the html retrieved from http://exporter.nih.gov/ExPORTER_Catalog.aspx?index=0 If I print out the resulting soup, it ends like this: kZXI9IjAi"/></form></body></html> Searching for the last characters 9IjaI in the raw html, I found that it's in the middle of a huge viewstate. BeautifulSoup seems to have a problem with this. Any hint what I might be doing wrong or how to parse such a page? 回答1: BeautifulSoup uses a pluggable HTML parser to build the 'soup';

Web Crawler To get Links From New Website

六眼飞鱼酱① 提交于 2019-12-20 06:38:44
问题 I am trying to get the links from a news website page(from one of its archives). I wrote the following lines of code in Python: main.py contains : import mechanize from bs4 import BeautifulSoup url = "http://www.thehindu.com/archive/web/2010/06/19/" br = mechanize.Browser() htmltext = br.open(url).read() articletext = "" soup = BeautifulSoup(htmltext) for tag in soup.findAll('li', attrs={"data-section":"Business"}): articletext += tag.contents[0] print articletext An example of the object in