beautifulsoup

Is this site not suited for web scraping using beautifulsoup?

醉酒当歌 提交于 2021-01-29 07:17:47
问题 I try to use beautifulsoup to get the odds for each match on the following site: https://danskespil.dk/oddset/sports/category/990/counter-strike-go/matches The goal is to end up with some kind of text file containing the following: Match1, Team1, Odds for team1 winning, Team2, Odds for team2 winning Match2, Team1, Odds for team1 winning, Team2, Odds for team2 winning and so on... I am new to beautifulsoup so things already go wrong at a very elementary level. My approach is to "walk" through

How to extract text from span surrounded by div using beautifulsoup

混江龙づ霸主 提交于 2021-01-29 07:11:55
问题 I have a html snippet as below: <div class="single_baby_name_description"> <label>Meaning :</label> <span class="28816-meaning">the meaning of this name is universal whole.</span> </br> <label>Gender :</label> <span class="28816-gender">Girl</span> </br> <label>Religion :</label> <span class="28816-religion">Christianity</span> </br> <label>Origin :</label> <span class="28816-origin">German,French,Swedish</span> </br> </div> I attempt to extract text from all span inside div using soup =

BeautifulSoup not extracting div properly

安稳与你 提交于 2021-01-29 06:44:11
问题 BeautifulSoup is not extracting the div I want properly. I am not sure what I am doing wrong. Here is the html: <div id='display'> <div class='result'> <div>text0 </p></div> <div>text1</div> <div>text2</div> </div> </div> And here is my code: div = soup.find("div", {"class": "result"}) print(div) I am seeing this: <div class="result"> <div>text0 </div></div> What I am expecting is this: <div class="result"> <div>text0</div> <div>text1</div> <div>text2</div> </div> This works as expected if I

How to accelerate Webscraping using the combination of Request and BeautifulSoup in Python?

末鹿安然 提交于 2021-01-29 06:12:00
问题 The objective is to scrape multiple pages using BeautifulSoup whose input is from the requests.get module. The steps are: First load up the html using requests page = requests.get('https://oatd.org/oatd/' + url_to_pass) Then, scrape the html content using the definition below: def get_each_page(page_soup): return dict(paper_author=page_soup.find(attrs={"itemprop": "name"}).text, paper_title=page_soup.find(attrs={"itemprop": "name"}).text) Say, we have a hundred of unique url to be scrap [

Scrape table with BeautifulSoup

一世执手 提交于 2021-01-29 04:20:49
问题 I have a table structure that looks like this : <tr><td> <td> <td bgcolor="#E6E6E6" valign="top" align="left">testtestestes</td> </tr> <tr nowrap="nowrap" valign="top" align="left"> <td nowrap="nowrap">8-K</td> <td class="small">Current report, items 1.01, 3.02, and 9.01 <br>Accession Number: 0001283140-16-000129  Act: 34  Size: 520 KB </td> <td nowrap="nowrap">2016-09-19<br>17:30:01</td> <td nowrap="nowrap">2016-09-19</td><td align="left" nowrap="nowrap"><a href="/cgi-bin/browse-edgar?action

Scraping part of a Wikipedia Infobox

百般思念 提交于 2021-01-29 03:49:42
问题 I'm using Python 2.7, requests & BeautifulSoup to scrape approximately 50 Wikipedia pages. I've created a column in my dataframe that has partial URL's that relate to the name of each song (these have been verified previously and I'm getting response code 200 when testing against all of them). My code loops through and appends these individual URL's to the main Wikipedia URL. I've been able to get the heading of the page or other data, but what I really want is the Length of the song only

Scraping part of a Wikipedia Infobox

不羁的心 提交于 2021-01-29 03:49:40
问题 I'm using Python 2.7, requests & BeautifulSoup to scrape approximately 50 Wikipedia pages. I've created a column in my dataframe that has partial URL's that relate to the name of each song (these have been verified previously and I'm getting response code 200 when testing against all of them). My code loops through and appends these individual URL's to the main Wikipedia URL. I've been able to get the heading of the page or other data, but what I really want is the Length of the song only

Python 64 bit not storing as long of string as 32 bit python

≯℡__Kan透↙ 提交于 2021-01-29 03:11:43
问题 I have two computers, both running 64-bit Windows 7. One machine has python 32-bit, one is running python 64-bit. Both machines have 8GB of RAM. I'm using BeautifulSoup to scrape a webpage, but I've been running into issues on my python64 machine. I've been able to figure out that the output of my len(str(BeautifulSoup(request.get(http://www.sampleurl.com).text))) in 64bit is only returning 92520 characters but on the same, static, site on my python32-bit machine, it's returning 135000

Python: Parse from list only prints last item, not all?

我怕爱的太早我们不能终老 提交于 2021-01-29 02:22:50
问题 My code: from urllib2 import urlopen from bs4 import BeautifulSoup url = "https://realpython.com/practice/profiles.html" html_page = urlopen(url) html_text = html_page.read() soup = BeautifulSoup(html_text) links = soup.find_all('a', href = True) files = [] base = "https://realpython.com/practice/" def page_names(): for a in links: files.append(base + a['href']) page_names() for i in files: all_page = urlopen(i) all_text = all_page.read() all_soup = BeautifulSoup(all_text) print all_soup The

Python: Parse from list only prints last item, not all?

烂漫一生 提交于 2021-01-29 02:05:33
问题 My code: from urllib2 import urlopen from bs4 import BeautifulSoup url = "https://realpython.com/practice/profiles.html" html_page = urlopen(url) html_text = html_page.read() soup = BeautifulSoup(html_text) links = soup.find_all('a', href = True) files = [] base = "https://realpython.com/practice/" def page_names(): for a in links: files.append(base + a['href']) page_names() for i in files: all_page = urlopen(i) all_text = all_page.read() all_soup = BeautifulSoup(all_text) print all_soup The