beautifulsoup | 易学教程

Bs4 select_one vs find

阅读更多关于 Bs4 select_one vs find

问题 I was wondering what is the difference between performing bs.find('div') and bs.select_one('div') . Same goes for find_all and select . Is there any difference performance wise, or if any is better to use over the other in specific cases. 回答1: select() and select_one() give you a different way navigating through an HTML tree using the CSS selectors which has rich and convenient syntax. Though, the CSS selector syntax support in BeautifulSoup is limited but covers most common cases.

BeautifulSoup can't find class that exists on webpage?

阅读更多关于 BeautifulSoup can't find class that exists on webpage?

问题 So I am trying to scrape the following webpage https://www.scoreboard.com/uk/football/england/premier-league/ , Specifically the scheduled and finished results. Thus I am trying to look for the elements with class = "stage-finished" or "stage-scheduled" . However when I scrape the webpage and print out what page_soup contains, it doesn't contain these elements. I found another SO question with an answer saying that this is because it is loaded via AJAX and I need to look at the XHR under the

beautifulsoup .get_text() is not specific enough for my HTML parsing

阅读更多关于 beautifulsoup .get_text() is not specific enough for my HTML parsing

问题 Given the HTML code below I want output just the text of the h1 but not the "Details about ", which is the text of the span (which is encapsulated by the h1). My current output gives: Details about New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black I would like: New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black Here is the HTML I am working with <h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about </span>New Men's

BeautifulSoup - findAll not within certain tag

阅读更多关于 BeautifulSoup - findAll not within certain tag

问题 So I'm trying to find a way to find all items within a BeautifulSoup object that have a certain tag that aren't within a certain other tag. For example: <td class="disabled first"> <div class="dayContainer"> <p class="day"> 29 </p> <p class="moreLink"> </p> </div> </td> I want to find all iterations of class="dayContainer" , which is simple enough, but how do I go about finding all of those that aren't first within class="diabled" ? 回答1: Run a filter for tags whose .parent doesn't have that

Scrape using Beautiful Soup preserving entities

阅读更多关于 Scrape using Beautiful Soup preserving entities

问题 I would like to scrape a table from the web and keep the entities intact so that I can republish as HTML later. BeautifulSoup seems to be converting these to spaces though. Example: from bs4 import BeautifulSoup html = "<html><body><table><tr>" html += "<td> hello </td>" html += "</tr></table></body></html>" soup = BeautifulSoup(html) table = soup.find_all('table')[0] row = table.find_all('tr')[0] cell = row.find_all('td')[0] print cell observed result: <td> hello </td> required result: <td

Find all tables in html using BeautifulSoup

阅读更多关于 Find all tables in html using BeautifulSoup

问题 I want to find all tables in html using BeautifulSoup. Inner tables should be included in outer tables. I have created some code which works and it gives expected output. But, I don't like this solution, because it destroys 'soup' object. Do you know how to do it in more elegant way ? from BeautifulSoup import BeautifulSoup as bs input = '''<html><head><title>title</title></head> <body> <p>paragraph</p> <div><div> <table>table1<table>inner11<table>inner12</table></table></table> <div><table

Not able to find the data from xpath

阅读更多关于 Not able to find the data from xpath

问题 I tried to extract the data every minute and write the data into csv file but I coun't do it. Since I am new to this broad data science world. I tried findall with soup library but not showing the data. import requests from bs4 import BeautifulSoup page = requests.get('https://finviz.com/forex_performance.ashx') soup = BeautifulSoup(page.content, 'html.parser') forex = soup.find_all("div", {"class": "content "}) print(forex) I would like to get the data like following format name of the

Reading 1000s of XML documents with BeautifulSoup

阅读更多关于 Reading 1000s of XML documents with BeautifulSoup

问题 I'm trying to read a bunch of xml files and do stuff to them. The first thing I want to do is rename them based on a number that's inside the file. You can see a sample of the data here Warning this will initiate a download of a 108MB zip file! . That's a huge xml file with thousands of smaller xml files inside it. I've broken those out into individual files. I want to rename the files based on a number inside (part of preprocessing). I have the following code: from __future__ import print

BeautifulSoup html missing

阅读更多关于 BeautifulSoup html missing

问题 I'm trying to get the url for the link to download historical data from Yahoo Finance for an asset during a specific timeframe. January 1, 1999 to present day. So for example if I go here: https://finance.yahoo.com/quote/XLB/history?period1=915177600&period2=1498633200&interval=1d&filter=history&frequency=1d I would want to acquire this (from the "Download Data" link above the table of data): "https://query1.finance.yahoo.com/v7/finance/download/XLB?period1=915177600&period2=1498633200

BeautifulSoup html missing

阅读更多关于 BeautifulSoup html missing