beautifulsoup

Python - find a substring between two strings based on the last occurence of the later string

半城伤御伤魂 提交于 2021-02-08 12:11:44
问题 I am trying to find a substring which is between to strings. The first string is <br> and the last string is <br><br> . The first string I look for is repetitive, while the later string can serve as an anchor. Here is an example: <div class="linkTabBl" style="float:left;padding-top:6px;width:240px"> Anglo American plc <br> 20 Carlton House Terrace <br> SW1Y 5AN London <br> United Kingdom <br><br> Phone : +44 (0)20 7968 8888 <br> Fax : +44 (0)20 7968 8500 <br> Internet : <a class="pageprofil

How to print paragraphs and headings simultaneously while scraping in Python?

坚强是说给别人听的谎言 提交于 2021-02-08 11:52:06
问题 I am a beginner in python. I am currently using Beautifulsoup to scrape a website. str='' #my_url source = urllib.request.urlopen(str); soup = bs.BeautifulSoup(source,'lxml'); match=soup.find('article',class_='xyz'); for paragraph in match.find_all('p'): str+=paragraph.text+"\n" My tag Structure - <article class="xyz" > <h4>dr</h4> <p>efkl</p> <h4>dr</h4> <p>efkl</p> <h4>dr</h4> <p>efkl</p> <h4>dr</h4> <p>efkl</p> </article> I am getting output like this (as I am able to extract the

How can i make sure that i am on About us page of a particular website

笑着哭i 提交于 2021-02-08 11:50:10
问题 Here's a snippet of code which i am trying to use to retrieve all the links from a website given the URL of a homepage. import requests from BeautifulSoup import BeautifulSoup url = "https://www.udacity.com" response = requests.get(url) page = str(BeautifulSoup(response.content)) def getURL(page): start_link = page.find("a href") if start_link == -1: return None, 0 start_quote = page.find('"', start_link) end_quote = page.find('"', start_quote + 1) url = page[start_quote + 1: end_quote]

Not able to scrap the images from Flipkart.com website the src attribute is coming emtpy

好久不见. 提交于 2021-02-08 11:18:21
问题 I am able to scrap all the data from flipkart website except the images using the code below: jobs = soup.find_all('div',{"class":"IIdQZO _1R0K0g _1SSAGr"}) for job in jobs: product_name = job.find('a',{'class':'_2mylT6'}) product_name = product_name.text if product_name else "N/A" product_offer_price = job.find('div',{'class':'_1vC4OE'}) product_offer_price = product_offer_price.text if product_offer_price else "N/A" product_mrp = job.find('div',{'class':'_3auQ3N'}) product_mrp = product_mrp

Scraping wrong table

你。 提交于 2021-02-08 10:46:31
问题 I'm trying to get the advanced stats of players onto an excel sheet but the table it's scraping is the first one instead of the advanced stats table. ValueError: Length of passed values is 23, index implies 21 If i try to use the id instead, i get an another error about tbody. Also, I get an error about lname=name.split(" ")[1] IndexError: list index out of range. I think that has to do with 'Nene' in the list. Is there a way to fix that? import requests from bs4 import BeautifulSoup

How to scrape many dynamic urls in Python

狂风中的少年 提交于 2021-02-08 10:30:39
问题 I want to scrape one dynamic url at a time. What I did is that I scrape the URL from that I get from all the href s and then I want to scrape that URL. What I am trying: from bs4 import BeautifulSoup import urllib.request import re r = urllib.request.urlopen('http://i.cantonfair.org.cn/en/ExpExhibitorList.aspx?k=glassware') soup = BeautifulSoup(r, "html.parser") links = soup.find_all("a", href=re.compile(r"expexhibitorlist\.aspx\?categoryno=[0-9]+")) linksfromcategories = ([link["href"] for

How to scrape many dynamic urls in Python

孤者浪人 提交于 2021-02-08 10:29:24
问题 I want to scrape one dynamic url at a time. What I did is that I scrape the URL from that I get from all the href s and then I want to scrape that URL. What I am trying: from bs4 import BeautifulSoup import urllib.request import re r = urllib.request.urlopen('http://i.cantonfair.org.cn/en/ExpExhibitorList.aspx?k=glassware') soup = BeautifulSoup(r, "html.parser") links = soup.find_all("a", href=re.compile(r"expexhibitorlist\.aspx\?categoryno=[0-9]+")) linksfromcategories = ([link["href"] for

BeautifulSoup returns some weird text for the <a> tag

ぃ、小莉子 提交于 2021-02-08 10:25:53
问题 I'm new to web scraping and I'm trying to scrape data from this auction website. However, I meet this weird problem when trying to get the text of the anchor tag. Here's the HTML: <div class="mt50"> <div class="head_011"> <a id="item_event_title" href="https://www.storyltd.com/auction/auction.aspx?eid=4158">NO RESERVE AUCTION OF MODERN AND CONTEMPORARY ART (16-17 APRIL 2019)</a> </div> </div> Here's my code: auction_info = LTD_work_soup.find('a', id = 'item_event_title').text print(auction

Loop url from dataframe and download pdf files in Python

☆樱花仙子☆ 提交于 2021-02-08 10:16:36
问题 Based on the code from here, I'm able to crawler url for each transation and save them into an excel file which can be downloaded here. Now I would like to go further and click the url link: For each url , I will need to open and save pdf format files: How could I do that in Python? Any help would be greatly appreciated. Code for references: import shutil from bs4 import BeautifulSoup import requests import os from urllib.parse import urlparse url = 'xxx' for page in range(6): r = requests

Loop url from dataframe and download pdf files in Python

£可爱£侵袭症+ 提交于 2021-02-08 10:15:21
问题 Based on the code from here, I'm able to crawler url for each transation and save them into an excel file which can be downloaded here. Now I would like to go further and click the url link: For each url , I will need to open and save pdf format files: How could I do that in Python? Any help would be greatly appreciated. Code for references: import shutil from bs4 import BeautifulSoup import requests import os from urllib.parse import urlparse url = 'xxx' for page in range(6): r = requests