beautifulsoup

Webscraping information off website using python requests

坚强是说给别人听的谎言 提交于 2020-07-22 05:50:29
问题 Webscraping https://www.nike.com/w/mens-shoes-nik1zy7ok for shoes. Right now I can retrieve the shoes that initially load, and also the shoes that load as you scroll to the next page with the following code: import re import json import requests from bs4 import BeautifulSoup url = 'https://www.nike.com/gb/w/womens-shoes-5e1x6zy7ok' html_data = requests.get(url).text data = json.loads(re.search(r'window.INITIAL_REDUX_STATE=(\{.*?\});', html_data).group(1)) for p in data['Wall']['products']:

How to scrape data off morningstar

╄→尐↘猪︶ㄣ 提交于 2020-07-22 05:19:13
问题 So Im new to the world of web scraping and so far I've only really been using beautifulsoup to scrape text and images off websites. I thought Id try and scrape some data points off a graph to test my understanding but I got a bit confused at this graph. After inspecting the element of the piece of data I wanted to extract, I saw this: <span id="TSMAIN">: 100.7490637</span> The problem is, my original idea for scraping the data points would be to have iterated through some sort of id list

Extracting particular data with BeautifulSoup with span tags

梦想的初衷 提交于 2020-07-20 05:42:28
问题 I have this structure. <div id="one" class="tab-pane active"> <div class="item-content"> <a href="/mobilni-uredjaji.4403.html"> <div class="item-primary"> <div class="sticker-small"> <span class=""></span> </div> <div class="sticker-small-lte"> <span class="lte"></span> </div> <div class="item-photo"> <img src="/upload/images/thumbs/devices/SAMG935F/SAMG935F_image001_220x230.png" alt="Samsung Galaxy S7 edge"> </div> <div class="item-labels"> <span class="item-manufacturer">Samsung</span>

Extracting particular data with BeautifulSoup with span tags

大憨熊 提交于 2020-07-20 05:42:05
问题 I have this structure. <div id="one" class="tab-pane active"> <div class="item-content"> <a href="/mobilni-uredjaji.4403.html"> <div class="item-primary"> <div class="sticker-small"> <span class=""></span> </div> <div class="sticker-small-lte"> <span class="lte"></span> </div> <div class="item-photo"> <img src="/upload/images/thumbs/devices/SAMG935F/SAMG935F_image001_220x230.png" alt="Samsung Galaxy S7 edge"> </div> <div class="item-labels"> <span class="item-manufacturer">Samsung</span>

Extracting particular data with BeautifulSoup with span tags

前提是你 提交于 2020-07-20 05:41:26
问题 I have this structure. <div id="one" class="tab-pane active"> <div class="item-content"> <a href="/mobilni-uredjaji.4403.html"> <div class="item-primary"> <div class="sticker-small"> <span class=""></span> </div> <div class="sticker-small-lte"> <span class="lte"></span> </div> <div class="item-photo"> <img src="/upload/images/thumbs/devices/SAMG935F/SAMG935F_image001_220x230.png" alt="Samsung Galaxy S7 edge"> </div> <div class="item-labels"> <span class="item-manufacturer">Samsung</span>

Use Beatiful Soup in scraping multiple websites

我与影子孤独终老i 提交于 2020-07-19 06:19:30
问题 I want to know why lists all_links and all_titles don't want to receive any records from lists titles and links . I have tried also .extend() method and it didn't help. import requests from bs4 import BeautifulSoup all_links = [] all_titles = [] def title_link(page_num): page = requests.get( 'https://www.gumtree.pl/s-mieszkania-i-domy-sprzedam-i-kupie/warszawa/page-%d/v%dc9073l3200008p%d' % (page_num, page_num, page_num)) soup = BeautifulSoup(page.content, 'html.parser') links = ['https://www

Scraping large amount of Google Scholar pages with url

不羁的心 提交于 2020-07-18 02:52:08
问题 I'm trying to get full author list of all publications from an author on Google scholar using BeautifulSoup. Since the home page for the author only has a truncated list of authors for each paper, I have to open the link of the paper to get full list. As a result, I ran into CAPTCHA every few attempts. Is there a way to avoid CAPTCHA (e.g. pause for 3 secs after every request)? Or make the original Google Scholar profile page to show full author list? 回答1: Recently I faced similar issue. I at

lxml / BeautifulSoup parser warning

好久不见. 提交于 2020-07-17 10:26:49
问题 Using Python 3, I'm trying to parse ugly HTML (which is not under my control) by using lxml with BeautifulSoup as explained here: http://lxml.de/elementsoup.html Specifically, I want to use lxml , but I'd like to use BeautifulSoup because like I said, it's ugly HTML and lxml will reject it on its own. The link above says: "All you need to do is pass it to the fromstring() function:" from lxml.html.soupparser import fromstring root = fromstring(tag_soup) So that's what I'm doing: URL = 'http:/

how to preserve links when scraping a table with beautiful soup and pandas

只谈情不闲聊 提交于 2020-07-17 08:24:16
问题 Scraping a web to get a table, using Beautiful soup and Pandas . One of the columns got some urls. When I pass html to pandas, href are lost. is there any way of preserving the url link just for that column? Example data (edited for better suit ral case): <html> <body> <table> <tr> <td>customer</td> <td>country</td> <td>area</td> <td>website link</td> </tr> <tr> <td>IBM</td> <td>USA</td> <td>EMEA</td> <td><a href="http://www.ibm.com">IBM site</a></td> </tr> <tr> <td>CISCO</td> <td>USA</td>

Scraper in Python gives “Access Denied”

南笙酒味 提交于 2020-07-15 19:24:11
问题 I'm trying to code a scraper in Python to get some info from a page. Like the title of the offers that appear on this page: https://www.justdial.com/Panipat/Saree-Retailers/nct-10420585 By now I use this code : import bs4 import requests def extract_source(url): source=requests.get(url).text return source def extract_data(source): soup=bs4.BeautifulSoup(source) names=soup.findAll('title') for i in names: print i extract_data(extract_source('https://www.justdial.com/Panipat/Saree-Retailers/nct