beautifulsoup

How to extract the strong elements which are in div tag

南笙酒味 提交于 2020-01-24 11:33:33
问题 I am new to web scraping. I am using Python to scrape the data. Can someone help me in how to extract data from: <div class="dept"><strong>LENGTH:</strong> 15 credits</div> My output should be LENGTH: 15 credits Here is my code: from urllib.request import urlopen from bs4 import BeautifulSoup length=bsObj.findAll("strong") for leng in length: print(leng.text,leng.next_sibling) Output: DELIVERY: Campus LENGTH: 2 years OFFERED BY: Olin Business School but I would like to have only LENGTH.

Webpage values are missing while scraping data using BeautifulSoup python 3.6

删除回忆录丶 提交于 2020-01-24 09:45:31
问题 I am using below script to scrap "STOCK QUOTE" data from http://fortune.com/fortune500/xcel-energy/, But its giving blank. I have used selenium driver also, but same issue. Please help on this. import requests from bs4 import BeautifulSoup as bs import pandas as pd r = requests.get('http://fortune.com/fortune500/xcel-energy/') soup = bs(r.content, 'lxml') # tried: 'html.parser data = pd.DataFrame(columns=['C1','C2','C3','C4'], dtype='object', index=range(0,11)) for table in soup.find_all('div

Webpage values are missing while scraping data using BeautifulSoup python 3.6

ぃ、小莉子 提交于 2020-01-24 09:45:27
问题 I am using below script to scrap "STOCK QUOTE" data from http://fortune.com/fortune500/xcel-energy/, But its giving blank. I have used selenium driver also, but same issue. Please help on this. import requests from bs4 import BeautifulSoup as bs import pandas as pd r = requests.get('http://fortune.com/fortune500/xcel-energy/') soup = bs(r.content, 'lxml') # tried: 'html.parser data = pd.DataFrame(columns=['C1','C2','C3','C4'], dtype='object', index=range(0,11)) for table in soup.find_all('div

BeautifulSoup difference between findAll and findChildren

随声附和 提交于 2020-01-24 09:40:08
问题 What is the difference? Don't they do the same thing - find the inside tags with given properties? 回答1: findChildren returns a resultSet just as find_all does, there is no difference in using either method as findChildren is actually find_all , if you look at the link to the source you can see: findChildren = find_all # BS2 It's there for backwards compatibility as is findAll = find_all # BS3 来源: https://stackoverflow.com/questions/38838460/beautifulsoup-difference-between-findall-and

How to web-scrape multiple page with Selenium (Python)

一笑奈何 提交于 2020-01-24 09:08:05
问题 I've seen several solutions to scrape multiple pages from a website, but couldn't make it work on my code. At the moment, I have this code, that is working to scrape the first page. And I would like to create a loop to scrape all the page of the website (from page 1 to 5) import pandas as pd from selenium import webdriver from selenium.webdriver.chrome.options import Options from bs4 import BeautifulSoup options = Options() options.add_argument("window-size=1400,600") from fake_useragent

How to web-scrape multiple page with Selenium (Python)

此生再无相见时 提交于 2020-01-24 09:08:04
问题 I've seen several solutions to scrape multiple pages from a website, but couldn't make it work on my code. At the moment, I have this code, that is working to scrape the first page. And I would like to create a loop to scrape all the page of the website (from page 1 to 5) import pandas as pd from selenium import webdriver from selenium.webdriver.chrome.options import Options from bs4 import BeautifulSoup options = Options() options.add_argument("window-size=1400,600") from fake_useragent

Can't remove line breaks from BeautifulSoup text output (Python 2.7.5)

不羁的心 提交于 2020-01-24 03:46:10
问题 I'm trying to write a program to parse a series of HTML files and store the resulting data in a .csv spreadsheet, which is incredibly reliant on newlines being in exactly the right place. I've tried every method I can find to strip the linebreaks away from certain pieces of text, to no avail. The relevant code looks like this: soup = BeautifulSoup(f) ID = soup.td.get_text() ID.strip() ID.rstrip() ID.replace("\t", "").replace("\r", "").replace("\n", "") dateCreated = soup.td.find_next("td")

What would be the best way to extract square meters from a string that also mentions the amount of bedrooms?

跟風遠走 提交于 2020-01-24 00:23:10
问题 I'm trying to extract: <div class="xl-surface-ch">  84 m²    2 bed. </div> from link the problem is, I only need the "84" in this string (they sometimes go over 2 or 3 digits as well). Added difficulty is that sometimes the square meters are not mentioned, which looks like this: <div class="xl-surface-ch">    2 bed. </div> and in that case I'd need to return a 0 My best attempt is: sqm = [] for item in soup.findAll('div', attrs={'class': 'xl-surface-ch'}): item = item.contents[0].strip()[0:4]

Using python/BeautifulSoup to replace HTML tag pair with a different one

China☆狼群 提交于 2020-01-23 23:10:53
问题 I need to replace a matching pair of HTML tags by another tag. Probably BeautifulSoup (4) would be suitable for the task, but I've never used it before and haven't found a suitable example anywhere, can someone give me a hint? For example, this HTML code: <font color="red">this text is red</font> Should be changed to this: <span style="color: red;">this text is red</span> The beginning and ending HTML tags may not be in the same line. 回答1: Use replace_with() to replace elements. Adapting the

Get a list of tags and get the attribute values in BeautifulSoup

我们两清 提交于 2020-01-23 21:02:09
问题 I'm attempting to use BeautifulSoup so get a list of HTML <div> tags, then check if they have a name attribute and then return that attribute value. Please see my code: soup = BeautifulSoup(html) #assume html contains <div> tags with a name attribute nameTags = soup.findAll('name') for n in nameTags: if n.has_key('name'): #get the value of the name attribute My question is how do I get the value of the name attribute? 回答1: Use the following code, it should work nameTags = soup.findAll('div',{