beautifulsoup | 易学教程

Download files using requests and BeautifulSoup

阅读更多关于 Download files using requests and BeautifulSoup

问题 I'm trying download a bunch of pdf files from here using requests and beautifulsoup4 . This is my code: import requests from bs4 import BeautifulSoup as bs _ANO = '2013/' _MES = '01/' _MATERIAS = 'matematica/' _CONTEXT = 'wp-content/uploads/' + _ANO + _MES _URL = 'http://www.desconversa.com.br/' + _MATERIAS + _CONTEXT r = requests.get(_URL) soup = bs(r.text) for i, link in enumerate(soup.findAll('a')): _FULLURL = _URL + link.get('href') for x in range(i): output = open('file[%d].pdf' % x, 'wb

Scraping multiple paginated links with BeautifulSoup and Requests

阅读更多关于 Scraping multiple paginated links with BeautifulSoup and Requests

问题 Python Beginner here. I'm trying to scrape all products from one category on dabs.com. I've managed to scrape all products on a given page, but I'm having trouble iterating over all the paginated links. Right now, I've tried to isolate all the pagination buttons with the span class='page-list" but even that isn't working. Ideally, I would like to make the crawler keep clicking next until it has scraped all products on all pages. How can I do this? Really appreciate any input from bs4 import

writing and saving CSV file from scraping data using python and Beautifulsoup4

阅读更多关于 writing and saving CSV file from scraping data using python and Beautifulsoup4

问题 I am trying to scrape data from the PGA.com website to get a table of all of the golf courses in the United States. In my CSV table I want to include the Name of the golf course ,Address ,Ownership ,Website , Phone number. With this data I would like to geocode it and place into a map and have a local copy on my computer I utilized Python and Beautiful Soup4 to extract my data. I have reached as far to extract the data from the website but I am having difficulty on writing the script to

extracting element and insert a space

阅读更多关于 extracting element and insert a space

问题 im parsing html using BeautifulSoup in python i dont know how to insert a space when extracting text element this is the code: import BeautifulSoup soup=BeautifulSoup.BeautifulSoup('<html>this<b>is</b>example</html>') print soup.text then output is thisisexample but i want to insert a space to this like yes is example how do i insert a space? 回答1: Use getText instead: import BeautifulSoup soup=BeautifulSoup.BeautifulSoup('<html>this<b>is</b>example</html>') print soup.getText(separator=u' ')

Get contents by class names using Beautiful Soup

阅读更多关于 Get contents by class names using Beautiful Soup

问题 Using Beautiful Soup module, how can I get data of a div tag whose class name is feeditemcontent cxfeeditemcontent ? Is it: soup.class['feeditemcontent cxfeeditemcontent'] or: soup.find_all('class') This is the HTML source: <div class="feeditemcontent cxfeeditemcontent"> <div class="feeditembodyandfooter"> <div class="feeditembody"> <span>The actual data is some where here</span> </div> </div> </div> and this is the Python code: from BeautifulSoup import BeautifulSoup html_doc = open('home

Difference between BeautifulSoup and Scrapy crawler?

阅读更多关于 Difference between BeautifulSoup and Scrapy crawler?

问题 I want to make a website that shows the comparison between amazon and e-bay product price. Which of these will work better and why? I am somewhat familiar with BeautifulSoup but not so much with Scrapy crawler . 回答1: Scrapy is a Web-spider or web scraper framework , You give Scrapy a root URL to start crawling, then you can specify constraints on how many (number of) URLs you want to crawl and fetch,etc. It is a complete framework for web-scraping or crawling . While BeautifulSoup is a

Why cycle repeats and doesn't change variable? [closed]

阅读更多关于 Why cycle repeats and doesn't change variable? [closed]

问题 Closed . This question needs details or clarity. It is not currently accepting answers. Want to improve this question? Add details and clarify the problem by editing this post. Closed 3 days ago . #import libraries import requests from bs4 import BeautifulSoup links = set() #"skeleton" of url base_url = 'https://steamcommunity.com/market/search?appid=730&q=#p{}' #site has 1300 pages, and i want to parse all of them count = 1301 for i in range(count): url = base_url.format(i) #send get request

Beautifulsoup find the tag and attribute of without value?

阅读更多关于 Beautifulsoup find the tag and attribute of without value?

问题 I'm trying to get the content of the particular tag which having the attribute but no values. How can I get it for example cont = '<nav></nav> <nav breadcrumbs> <a href="">aa</a></nav> <nav></nav>' From the above one I want to extract the <nav breadcrumbs> <a href="">aa</a></nav> So I have tried the following one bread = contSoup.find("nav",{"breadcrumbs":""}) I have tried below one also bread = contSoup.find("nav breadcrumbs") Finally I'm using RegEx to get this data, I'm able to get the

BeautifulSoup fails to parse long view state

阅读更多关于 BeautifulSoup fails to parse long view state

问题 I try to use BeautifulSoup4 to parse the html retrieved from http://exporter.nih.gov/ExPORTER_Catalog.aspx?index=0 If I print out the resulting soup, it ends like this: kZXI9IjAi"/></form></body></html> Searching for the last characters 9IjaI in the raw html, I found that it's in the middle of a huge viewstate. BeautifulSoup seems to have a problem with this. Any hint what I might be doing wrong or how to parse such a page? 回答1: BeautifulSoup uses a pluggable HTML parser to build the 'soup';

Web Crawler To get Links From New Website

阅读更多关于 Web Crawler To get Links From New Website

问题 I am trying to get the links from a news website page(from one of its archives). I wrote the following lines of code in Python: main.py contains : import mechanize from bs4 import BeautifulSoup url = "http://www.thehindu.com/archive/web/2010/06/19/" br = mechanize.Browser() htmltext = br.open(url).read() articletext = "" soup = BeautifulSoup(htmltext) for tag in soup.findAll('li', attrs={"data-section":"Business"}): articletext += tag.contents[0] print articletext An example of the object in