beautifulsoup

Bs4 select_one vs find

偶尔善良 提交于 2019-12-23 10:15:33
问题 I was wondering what is the difference between performing bs.find('div') and bs.select_one('div') . Same goes for find_all and select . Is there any difference performance wise, or if any is better to use over the other in specific cases. 回答1: select() and select_one() give you a different way navigating through an HTML tree using the CSS selectors which has rich and convenient syntax. Though, the CSS selector syntax support in BeautifulSoup is limited but covers most common cases.

BeautifulSoup can't find class that exists on webpage?

你离开我真会死。 提交于 2019-12-23 09:27:00
问题 So I am trying to scrape the following webpage https://www.scoreboard.com/uk/football/england/premier-league/ , Specifically the scheduled and finished results. Thus I am trying to look for the elements with class = "stage-finished" or "stage-scheduled" . However when I scrape the webpage and print out what page_soup contains, it doesn't contain these elements. I found another SO question with an answer saying that this is because it is loaded via AJAX and I need to look at the XHR under the

beautifulsoup .get_text() is not specific enough for my HTML parsing

狂风中的少年 提交于 2019-12-23 09:16:25
问题 Given the HTML code below I want output just the text of the h1 but not the "Details about ", which is the text of the span (which is encapsulated by the h1). My current output gives: Details about New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black I would like: New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black Here is the HTML I am working with <h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  </span>New Men's

BeautifulSoup - findAll not within certain tag

若如初见. 提交于 2019-12-23 08:52:32
问题 So I'm trying to find a way to find all items within a BeautifulSoup object that have a certain tag that aren't within a certain other tag. For example: <td class="disabled first"> <div class="dayContainer"> <p class="day"> 29 </p> <p class="moreLink"> </p> </div> </td> I want to find all iterations of class="dayContainer" , which is simple enough, but how do I go about finding all of those that aren't first within class="diabled" ? 回答1: Run a filter for tags whose .parent doesn't have that

Scrape using Beautiful Soup preserving   entities

人盡茶涼 提交于 2019-12-23 07:48:53
问题 I would like to scrape a table from the web and keep the   entities intact so that I can republish as HTML later. BeautifulSoup seems to be converting these to spaces though. Example: from bs4 import BeautifulSoup html = "<html><body><table><tr>" html += "<td> hello </td>" html += "</tr></table></body></html>" soup = BeautifulSoup(html) table = soup.find_all('table')[0] row = table.find_all('tr')[0] cell = row.find_all('td')[0] print cell observed result: <td> hello </td> required result: <td

Find all tables in html using BeautifulSoup

时光毁灭记忆、已成空白 提交于 2019-12-23 07:46:33
问题 I want to find all tables in html using BeautifulSoup. Inner tables should be included in outer tables. I have created some code which works and it gives expected output. But, I don't like this solution, because it destroys 'soup' object. Do you know how to do it in more elegant way ? from BeautifulSoup import BeautifulSoup as bs input = '''<html><head><title>title</title></head> <body> <p>paragraph</p> <div><div> <table>table1<table>inner11<table>inner12</table></table></table> <div><table

Not able to find the data from xpath

纵饮孤独 提交于 2019-12-23 06:14:43
问题 I tried to extract the data every minute and write the data into csv file but I coun't do it. Since I am new to this broad data science world. I tried findall with soup library but not showing the data. import requests from bs4 import BeautifulSoup page = requests.get('https://finviz.com/forex_performance.ashx') soup = BeautifulSoup(page.content, 'html.parser') forex = soup.find_all("div", {"class": "content "}) print(forex) I would like to get the data like following format name of the

Reading 1000s of XML documents with BeautifulSoup

流过昼夜 提交于 2019-12-23 05:45:10
问题 I'm trying to read a bunch of xml files and do stuff to them. The first thing I want to do is rename them based on a number that's inside the file. You can see a sample of the data here Warning this will initiate a download of a 108MB zip file! . That's a huge xml file with thousands of smaller xml files inside it. I've broken those out into individual files. I want to rename the files based on a number inside (part of preprocessing). I have the following code: from __future__ import print

BeautifulSoup html missing

我只是一个虾纸丫 提交于 2019-12-23 05:43:11
问题 I'm trying to get the url for the link to download historical data from Yahoo Finance for an asset during a specific timeframe. January 1, 1999 to present day. So for example if I go here: https://finance.yahoo.com/quote/XLB/history?period1=915177600&period2=1498633200&interval=1d&filter=history&frequency=1d I would want to acquire this (from the "Download Data" link above the table of data): "https://query1.finance.yahoo.com/v7/finance/download/XLB?period1=915177600&period2=1498633200

BeautifulSoup html missing

ぐ巨炮叔叔 提交于 2019-12-23 05:43:06
问题 I'm trying to get the url for the link to download historical data from Yahoo Finance for an asset during a specific timeframe. January 1, 1999 to present day. So for example if I go here: https://finance.yahoo.com/quote/XLB/history?period1=915177600&period2=1498633200&interval=1d&filter=history&frequency=1d I would want to acquire this (from the "Download Data" link above the table of data): "https://query1.finance.yahoo.com/v7/finance/download/XLB?period1=915177600&period2=1498633200