beautifulsoup | 易学教程

how to fetch data from javascript loaded site using beautifulsoup

阅读更多关于 how to fetch data from javascript loaded site using beautifulsoup

问题 I am trying to fetch some data from this website https://www.walmart.com/store/2141-philadelphia-pa/search?query=ice%20cream I have been using this method to fetch javascript loaded sites def getLocalStoreProducts(): session = requests.Session() localStoreUrl = 'https://www.walmart.com/store/2141-philadelphia-pa/search?query=' searchWord = "ice cream" searchWord1 = checkForSpace(searchWord) wordUrl = localStoreUrl+searchWord1 print(wordUrl) # try: categorySoup = BeautifulSoup(session.get

I get InvalidURL: URL can't contain control characters when I try to send a request using urllib

阅读更多关于 I get InvalidURL: URL can't contain control characters when I try to send a request using urllib

问题 I am trying to get a JSON response from the link used as a parameter to the urllib request. but it gives me an error that it can't contain control characters. how can I solve the issue? start_url = "https://devbusiness.un.org/solr-sitesearch-output/10//0/ds_field_last_updated/desc?bundle_fq =procurement_notice&sm_vid_Institutions_fq=&sm_vid_Procurement_Type_fq=&sm_vid_Countries_fq=&sm_vid_Sectors_fq= &sm_vid_Languages_fq=English&sm_vid_Notice_Type_fq=&deadline_multifield_fq=&ts_field_project

Extract link from url using Beautifulsoup

阅读更多关于 Extract link from url using Beautifulsoup

问题 I am trying to get the web link of the following, using beautifulsoup <div class="alignright single"> <a href="http://www.dailyhadithonline.com/2013/07/21/hadith-on-clothing-women-should-lower-their-garments-to-cover-their-feet/" rel="next">Hadith on Clothing: Women should lower their garments to cover their feet</a> » </div> </div> My code is as follow from bs4 import BeautifulSoup import urllib2 url1 = "http://www.dailyhadithonline.com/2013/07/21/hadith-on-clothing-the-lower-garment-should

Extract link from url using Beautifulsoup

阅读更多关于 Extract link from url using Beautifulsoup

Loop pages and save contents in Excel file from website in Python

阅读更多关于 Loop pages and save contents in Excel file from website in Python

问题 I'm trying to loop pages from this link and extract the interesting part. Please see the contents in the red circle in the image below. Here's what I've tried: url = 'http://so.eastmoney.com/Ann/s?keyword=购买物业&pageindex={}' for page in range(10): r = requests.get(url.format(page)) soup = BeautifulSoup(r.content, "html.parser") print(soup) xpath for each element (might be helpful for those that don't read Chinese): /html/body/div[3]/div/div[2]/div[2]/div[3]/h3/span --> 【润华物业】 /html/body/div[3]

Multiple conditions in BeautifulSoup: Text=True & IMG Alt=True

阅读更多关于 Multiple conditions in BeautifulSoup: Text=True & IMG Alt=True

问题 is there a way to use multiple conditions in BeautifulSoup? These are the two conditions I like to use together: Get text: soup.find_all(text=True) Get img alt: soup.find_all('img', title=True): I know I can do it separately but I would like to get it together to keep the flow of the HTML. The reason I'm doing this is because only BeautifulSoup extract the hidden text by css: Display None. When you use driver.find_element_by_tag_name('body').text you get the img alt att, but unfortunately not

Discord does not embed link when sent by my bot

阅读更多关于 Discord does not embed link when sent by my bot

问题 My code works fine and the bot sends the link, but Discord does not recognize it as one and does not embed it. When I copy and paste it myself, it then recognizes it as a link and embed the image. Here is my code: import requests from bs4 import BeautifulSoup if message.content.startswith(".dog"): response = requests.get("https://dog.ceo/api/breeds/image/random") soupRaw = BeautifulSoup(response.text, 'lxml') soupBackend = str(soupRaw).split("message") soup2 = soupBackend[1] soup3 = soup2[3:]

Get value of attribute using CSS Selectors with BeutifulSoup

阅读更多关于 Get value of attribute using CSS Selectors with BeutifulSoup

问题 I am web-scraping with Python and using BeutifulSoup library I have HTML markup like this: <tr class="deals" data-url="www.example2.com"> <span class="hotel-name"> <a href="www.example2.com"></a> </span> </tr> <tr class="deals" data-url="www.example3.com"> <span class="hotel-name"> <a href="www.example3.com"></a> </span> </tr> I want to get the data-url or the href value in all <tr> s. Better If I can get href value Here is a little snippet of my relevant code: main_url = "http://localhost

Maintaining the indentation of an XML file when parsed with Beautifulsoup

阅读更多关于 Maintaining the indentation of an XML file when parsed with Beautifulsoup

问题 I am using BS4 to parse an XML file and trying to write it back to a new XML file. Input file: <tag1> <tag2 attr1="a1"> example text </tag2> <tag3> <tag4 attr2="a2"> example text </tag4> <tag5> <tag6 attr3="a3"> example text </tag6> </tag5> </tag3> </tag1> Script: soup = BeautifulSoup(open("input.xml"), "xml") f = open("output.xml", "w") f.write(soup.encode(formatter='minimal')) f.close() Output: <tag1> <tag2 attr1="a1"> example text </tag2> <tag3> <tag4 attr2="a2"> example text </tag4> <tag5

Beautiful Soup returns None on existing element

阅读更多关于 Beautiful Soup returns None on existing element

问题 I'm trying to scrape the price of a product. Here's my code: from bs4 import BeautifulSoup as soup import requests page_url = "https://www.falabella.com/falabella-cl/product/5311682/Smartphone-iPhone-7-PLUS-32GB/5311682/" headers={ 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36' } uClient = requests.get(page_url, headers=headers) print(uClient) page_soup = soup(uClient.content, "html.parser") #requests