beautifulsoup

Extracting the text between two header tags using BeautifulSoup in Python

一笑奈何 提交于 2021-02-18 18:55:47
问题 I am trying to extract the plot of a movie, from the wikipedia page, in Python using BeautifulSoup. I am new to Python and BeautifulSoup so I am not sure how to actually approach it. This is the input code. <h2><span class="mw-headline" id="Plot">Plot</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php? title=Moana_(2016_film)&action=edit&section=1" title="Edit section: Plot">edit</a><span class="mw-editsection-bracket">]</span></span></h2> <p

Use soup.get_text() with UTF-8

旧时模样 提交于 2021-02-18 11:40:35
问题 I need to get all the text from a page using BeautifulSoup. At BeautifulSoup's documentation, it showed that you could do soup.get_text() to do this. When I tried doing this on reddit.com, I got this error: UnicodeEncodeError in soup.py:16 'cp932' codec can't encode character u'\xa0' in position 2262: illegal multibyte sequence I get errors like that on most of the sites I checked. I got similar errors when I did soup.prettify() too, but I fixed it by changing it to soup.prettify('UTF-8') .

BeautifulSoup.find_all() method not working with namespaced tags

我是研究僧i 提交于 2021-02-18 10:59:12
问题 I have encountered a very strange behaviour while working with BeautifulSoup today. Let's have a look at a very simple html snippet: <html><body><ix:nonfraction>lele</ix:nonfraction></body></html> I am trying to get the content of the <ix:nonfraction> tag with BeautifulSoup. Everything works fine when using the find method: from bs4 import BeautifulSoup html = "<html><body><ix:nonfraction>lele</ix:nonfraction></body></html>" soup = BeautifulSoup(html, 'lxml') # The parser used here does not

Using Beautiful Soup with accents and different characters

最后都变了- 提交于 2021-02-18 07:44:11
问题 I'm using Beautiful Soup to pull medal winners from past Olympics. It's tripping over the use of accents in some of the events and athlete names. I've seen similar problems posted online but I'm new to Python and having trouble applying them to my code. If I print my soup, the accents appear fine. but when I start parsing the soup (and write it to a CSV file) the accented characters become garbled. 'Louis Perrée' becomes 'Louis Perr√©e' from BeautifulSoup import BeautifulSoup import urllib2

BeautifulSoup - scraping a forum page

丶灬走出姿态 提交于 2021-02-17 09:04:47
问题 I'm trying to scrape a forum discussion and export it as a csv file, with rows such as "thread title", "user", and "post", where the latter is the actual forum post from each individual. I'm a complete beginner with Python and BeautifulSoup so I'm having a really hard time with this! My current problem is that all the text is split into one character per row in the csv file. Is there anyone out there who can help me out? It would be fantastic if someone could give me a hand! Here's the code I

Python beautifulsoup extract value without identifier

蹲街弑〆低调 提交于 2021-02-17 04:47:50
问题 I am facing a problem and don't know how to solve it properly. I want to extract the price (so in the first example 130€, in the second 130€). the problem is that the attributes are changing all the time. so I am unable to do something like this, because I am scraping hundreds of sites and and on each site the first 2 chars of the "id" attribute may differ: tag = soup_expose_html.find('span', attrs={'id' : re.compile(r'(07_content$)')}) Even if I would use something like this it wont work,

scrapy爬取美女图片

此生再无相见时 提交于 2021-02-15 13:26:40
使用scrapy爬取整个网站的图片数据。并且使用 CrawlerProcess 启动。 1 # -*- coding: utf-8 -* 2 import scrapy 3 import requests 4 from bs4 import BeautifulSoup 5 6 from meinr.items import MeinrItem 7 8 9 class Meinr1Spider(scrapy.Spider): 10 name = ' meinr1 ' 11 # allowed_domains = ['www.baidu.com'] 12 # start_urls = ['http://m.tupianzj.com/meinv/xiezhen/'] 13 headers = { 14 ' User-Agent ' : ' Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36 ' , 15 } 16 def num(self,url,headers): #获取网页每个分类的页数和URL格式 17 html = requests.get(url=url,headers= headers) 18 if html

爬取妹子图片

谁说胖子不能爱 提交于 2021-02-15 11:08:24
废话不多说,直接上代码 非原创 原作者记不得了 import requests from bs4 import BeautifulSoup def imgurl(url): res = requests.get(url) soup = BeautifulSoup(res.text, 'html.parser') # 获取总页数 page = int(soup.select('.pagenavi span')[-2].text) # 获取图片链接 a = soup.select('.main-image a')[0] src = a.select('img')[0].get('src') meiziid = src[-9:-6] print('开始下载妹子:',format(meiziid)) for i in range(1, page+1): i = '%02d' % i img = src.replace('01.jpg', str(i)+'.jpg') headers = { 'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)', 'Referer':'http://www.mzitu.com' } #此请求头破解防盗链 response = requests.get(img,headers

Reading value from HTML page - nseindia

妖精的绣舞 提交于 2021-02-15 07:52:27
问题 I want to read "Open", "High" and "Close" value of NIFTY 50 from the below web page. https://www1.nseindia.com/live_market/dynaContent/live_watch/live_index_watch.htm The below code was working before. Looks like there is some change in the webpage, I am not able to read the values as I am getting below error. nifty_50_row = table.find_all('tr')[2] # get first row of prices AttributeError: 'NoneType' object has no attribute 'find_all' Need your help to fix this issue. My code is as below: url

Web scraping an “onclick” object table on a website with python

随声附和 提交于 2021-02-15 07:44:51
问题 I am trying to scrape the data for this link: page. If you click the up arrow you will notice the highlighted days in the month sections. Clicking on a highlighted day, a table with initiated tenders for that day will appear. All I need to do is get the data in each table for each highlighted day in the calendar. There might be one or more tenders (up to max of 7) per day. Table appears on click I have done some web scraping with bs4, however I think that this is a job for selenium (please,