beautifulsoup | 易学教程

Python - find a substring between two strings based on the last occurence of the later string

阅读更多关于 Python - find a substring between two strings based on the last occurence of the later string

问题 I am trying to find a substring which is between to strings. The first string is <br> and the last string is <br><br> . The first string I look for is repetitive, while the later string can serve as an anchor. Here is an example: <div class="linkTabBl" style="float:left;padding-top:6px;width:240px"> Anglo American plc <br> 20 Carlton House Terrace <br> SW1Y 5AN London <br> United Kingdom <br><br> Phone : +44 (0)20 7968 8888 <br> Fax : +44 (0)20 7968 8500 <br> Internet : <a class="pageprofil

How to print paragraphs and headings simultaneously while scraping in Python?

阅读更多关于 How to print paragraphs and headings simultaneously while scraping in Python?

问题 I am a beginner in python. I am currently using Beautifulsoup to scrape a website. str='' #my_url source = urllib.request.urlopen(str); soup = bs.BeautifulSoup(source,'lxml'); match=soup.find('article',class_='xyz'); for paragraph in match.find_all('p'): str+=paragraph.text+"\n" My tag Structure - <article class="xyz" > <h4>dr</h4> <p>efkl</p> <h4>dr</h4> <p>efkl</p> <h4>dr</h4> <p>efkl</p> <h4>dr</h4> <p>efkl</p> </article> I am getting output like this (as I am able to extract the

How can i make sure that i am on About us page of a particular website

阅读更多关于 How can i make sure that i am on About us page of a particular website

问题 Here's a snippet of code which i am trying to use to retrieve all the links from a website given the URL of a homepage. import requests from BeautifulSoup import BeautifulSoup url = "https://www.udacity.com" response = requests.get(url) page = str(BeautifulSoup(response.content)) def getURL(page): start_link = page.find("a href") if start_link == -1: return None, 0 start_quote = page.find('"', start_link) end_quote = page.find('"', start_quote + 1) url = page[start_quote + 1: end_quote]

Not able to scrap the images from Flipkart.com website the src attribute is coming emtpy

阅读更多关于 Not able to scrap the images from Flipkart.com website the src attribute is coming emtpy

问题 I am able to scrap all the data from flipkart website except the images using the code below: jobs = soup.find_all('div',{"class":"IIdQZO _1R0K0g _1SSAGr"}) for job in jobs: product_name = job.find('a',{'class':'_2mylT6'}) product_name = product_name.text if product_name else "N/A" product_offer_price = job.find('div',{'class':'_1vC4OE'}) product_offer_price = product_offer_price.text if product_offer_price else "N/A" product_mrp = job.find('div',{'class':'_3auQ3N'}) product_mrp = product_mrp

Scraping wrong table

阅读更多关于 Scraping wrong table

问题 I'm trying to get the advanced stats of players onto an excel sheet but the table it's scraping is the first one instead of the advanced stats table. ValueError: Length of passed values is 23, index implies 21 If i try to use the id instead, i get an another error about tbody. Also, I get an error about lname=name.split(" ")[1] IndexError: list index out of range. I think that has to do with 'Nene' in the list. Is there a way to fix that? import requests from bs4 import BeautifulSoup

How to scrape many dynamic urls in Python

阅读更多关于 How to scrape many dynamic urls in Python

问题 I want to scrape one dynamic url at a time. What I did is that I scrape the URL from that I get from all the href s and then I want to scrape that URL. What I am trying: from bs4 import BeautifulSoup import urllib.request import re r = urllib.request.urlopen('http://i.cantonfair.org.cn/en/ExpExhibitorList.aspx?k=glassware') soup = BeautifulSoup(r, "html.parser") links = soup.find_all("a", href=re.compile(r"expexhibitorlist\.aspx\?categoryno=[0-9]+")) linksfromcategories = ([link["href"] for

How to scrape many dynamic urls in Python

阅读更多关于 How to scrape many dynamic urls in Python

BeautifulSoup returns some weird text for the <a> tag

阅读更多关于 BeautifulSoup returns some weird text for the tag

问题 I'm new to web scraping and I'm trying to scrape data from this auction website. However, I meet this weird problem when trying to get the text of the anchor tag. Here's the HTML: <div class="mt50"> <div class="head_011"> <a id="item_event_title" href="https://www.storyltd.com/auction/auction.aspx?eid=4158">NO RESERVE AUCTION OF MODERN AND CONTEMPORARY ART (16-17 APRIL 2019)</a> </div> </div> Here's my code: auction_info = LTD_work_soup.find('a', id = 'item_event_title').text print(auction

Loop url from dataframe and download pdf files in Python

阅读更多关于 Loop url from dataframe and download pdf files in Python

问题 Based on the code from here, I'm able to crawler url for each transation and save them into an excel file which can be downloaded here. Now I would like to go further and click the url link: For each url , I will need to open and save pdf format files: How could I do that in Python? Any help would be greatly appreciated. Code for references: import shutil from bs4 import BeautifulSoup import requests import os from urllib.parse import urlparse url = 'xxx' for page in range(6): r = requests

Loop url from dataframe and download pdf files in Python

阅读更多关于 Loop url from dataframe and download pdf files in Python