beautifulsoup | 易学教程

Extracting the text between two header tags using BeautifulSoup in Python

阅读更多关于 Extracting the text between two header tags using BeautifulSoup in Python

问题 I am trying to extract the plot of a movie, from the wikipedia page, in Python using BeautifulSoup. I am new to Python and BeautifulSoup so I am not sure how to actually approach it. This is the input code. <h2><span class="mw-headline" id="Plot">Plot</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php? title=Moana_(2016_film)&action=edit&section=1" title="Edit section: Plot">edit</a><span class="mw-editsection-bracket">]</span></span></h2> <p

Use soup.get_text() with UTF-8

阅读更多关于 Use soup.get_text() with UTF-8

问题 I need to get all the text from a page using BeautifulSoup. At BeautifulSoup's documentation, it showed that you could do soup.get_text() to do this. When I tried doing this on reddit.com, I got this error: UnicodeEncodeError in soup.py:16 'cp932' codec can't encode character u'\xa0' in position 2262: illegal multibyte sequence I get errors like that on most of the sites I checked. I got similar errors when I did soup.prettify() too, but I fixed it by changing it to soup.prettify('UTF-8') .

BeautifulSoup.find_all() method not working with namespaced tags

阅读更多关于 BeautifulSoup.find_all() method not working with namespaced tags

问题 I have encountered a very strange behaviour while working with BeautifulSoup today. Let's have a look at a very simple html snippet: <html><body><ix:nonfraction>lele</ix:nonfraction></body></html> I am trying to get the content of the <ix:nonfraction> tag with BeautifulSoup. Everything works fine when using the find method: from bs4 import BeautifulSoup html = "<html><body><ix:nonfraction>lele</ix:nonfraction></body></html>" soup = BeautifulSoup(html, 'lxml') # The parser used here does not

Using Beautiful Soup with accents and different characters

阅读更多关于 Using Beautiful Soup with accents and different characters

问题 I'm using Beautiful Soup to pull medal winners from past Olympics. It's tripping over the use of accents in some of the events and athlete names. I've seen similar problems posted online but I'm new to Python and having trouble applying them to my code. If I print my soup, the accents appear fine. but when I start parsing the soup (and write it to a CSV file) the accented characters become garbled. 'Louis Perrée' becomes 'Louis Perr√©e' from BeautifulSoup import BeautifulSoup import urllib2

BeautifulSoup - scraping a forum page

阅读更多关于 BeautifulSoup - scraping a forum page

问题 I'm trying to scrape a forum discussion and export it as a csv file, with rows such as "thread title", "user", and "post", where the latter is the actual forum post from each individual. I'm a complete beginner with Python and BeautifulSoup so I'm having a really hard time with this! My current problem is that all the text is split into one character per row in the csv file. Is there anyone out there who can help me out? It would be fantastic if someone could give me a hand! Here's the code I

Python beautifulsoup extract value without identifier

阅读更多关于 Python beautifulsoup extract value without identifier

问题 I am facing a problem and don't know how to solve it properly. I want to extract the price (so in the first example 130€, in the second 130€). the problem is that the attributes are changing all the time. so I am unable to do something like this, because I am scraping hundreds of sites and and on each site the first 2 chars of the "id" attribute may differ: tag = soup_expose_html.find('span', attrs={'id' : re.compile(r'(07_content$)')}) Even if I would use something like this it wont work,

scrapy爬取美女图片

阅读更多关于 scrapy爬取美女图片

使用scrapy爬取整个网站的图片数据。并且使用 CrawlerProcess 启动。 1 # -*- coding: utf-8 -* 2 import scrapy 3 import requests 4 from bs4 import BeautifulSoup 5 6 from meinr.items import MeinrItem 7 8 9 class Meinr1Spider(scrapy.Spider): 10 name = ' meinr1 ' 11 # allowed_domains = ['www.baidu.com'] 12 # start_urls = ['http://m.tupianzj.com/meinv/xiezhen/'] 13 headers = { 14 ' User-Agent ' : ' Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36 ' , 15 } 16 def num(self,url,headers): #获取网页每个分类的页数和URL格式 17 html = requests.get(url=url,headers= headers) 18 if html

爬取妹子图片

阅读更多关于爬取妹子图片

废话不多说，直接上代码非原创原作者记不得了 import requests from bs4 import BeautifulSoup def imgurl(url): res = requests.get(url) soup = BeautifulSoup(res.text, 'html.parser') # 获取总页数 page = int(soup.select('.pagenavi span')[-2].text) # 获取图片链接 a = soup.select('.main-image a')[0] src = a.select('img')[0].get('src') meiziid = src[-9:-6] print('开始下载妹子:',format(meiziid)) for i in range(1, page+1): i = '%02d' % i img = src.replace('01.jpg', str(i)+'.jpg') headers = { 'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)', 'Referer':'http://www.mzitu.com' } #此请求头破解防盗链 response = requests.get(img,headers

Reading value from HTML page - nseindia

阅读更多关于 Reading value from HTML page - nseindia

问题 I want to read "Open", "High" and "Close" value of NIFTY 50 from the below web page. https://www1.nseindia.com/live_market/dynaContent/live_watch/live_index_watch.htm The below code was working before. Looks like there is some change in the webpage, I am not able to read the values as I am getting below error. nifty_50_row = table.find_all('tr')[2] # get first row of prices AttributeError: 'NoneType' object has no attribute 'find_all' Need your help to fix this issue. My code is as below: url

Web scraping an “onclick” object table on a website with python

阅读更多关于 Web scraping an “onclick” object table on a website with python

问题 I am trying to scrape the data for this link: page. If you click the up arrow you will notice the highlighted days in the month sections. Clicking on a highlighted day, a table with initiated tenders for that day will appear. All I need to do is get the data in each table for each highlighted day in the calendar. There might be one or more tenders (up to max of 7) per day. Table appears on click I have done some web scraping with bs4, however I think that this is a job for selenium (please,