beautifulsoup | 易学教程

Webscraping information off website using python requests

阅读更多关于 Webscraping information off website using python requests

问题 Webscraping https://www.nike.com/w/mens-shoes-nik1zy7ok for shoes. Right now I can retrieve the shoes that initially load, and also the shoes that load as you scroll to the next page with the following code: import re import json import requests from bs4 import BeautifulSoup url = 'https://www.nike.com/gb/w/womens-shoes-5e1x6zy7ok' html_data = requests.get(url).text data = json.loads(re.search(r'window.INITIAL_REDUX_STATE=(\{.*?\});', html_data).group(1)) for p in data['Wall']['products']:

How to scrape data off morningstar

阅读更多关于 How to scrape data off morningstar

问题 So Im new to the world of web scraping and so far I've only really been using beautifulsoup to scrape text and images off websites. I thought Id try and scrape some data points off a graph to test my understanding but I got a bit confused at this graph. After inspecting the element of the piece of data I wanted to extract, I saw this: <span id="TSMAIN">: 100.7490637</span> The problem is, my original idea for scraping the data points would be to have iterated through some sort of id list

Extracting particular data with BeautifulSoup with span tags

阅读更多关于 Extracting particular data with BeautifulSoup with span tags

问题 I have this structure. <div id="one" class="tab-pane active"> <div class="item-content"> <a href="/mobilni-uredjaji.4403.html"> <div class="item-primary"> <div class="sticker-small"> <span class=""></span> </div> <div class="sticker-small-lte"> <span class="lte"></span> </div> <div class="item-photo"> <img src="/upload/images/thumbs/devices/SAMG935F/SAMG935F_image001_220x230.png" alt="Samsung Galaxy S7 edge"> </div> <div class="item-labels"> <span class="item-manufacturer">Samsung</span>

Extracting particular data with BeautifulSoup with span tags

阅读更多关于 Extracting particular data with BeautifulSoup with span tags

Extracting particular data with BeautifulSoup with span tags

阅读更多关于 Extracting particular data with BeautifulSoup with span tags

Use Beatiful Soup in scraping multiple websites

阅读更多关于 Use Beatiful Soup in scraping multiple websites

问题 I want to know why lists all_links and all_titles don't want to receive any records from lists titles and links . I have tried also .extend() method and it didn't help. import requests from bs4 import BeautifulSoup all_links = [] all_titles = [] def title_link(page_num): page = requests.get( 'https://www.gumtree.pl/s-mieszkania-i-domy-sprzedam-i-kupie/warszawa/page-%d/v%dc9073l3200008p%d' % (page_num, page_num, page_num)) soup = BeautifulSoup(page.content, 'html.parser') links = ['https://www

Scraping large amount of Google Scholar pages with url

阅读更多关于 Scraping large amount of Google Scholar pages with url

问题 I'm trying to get full author list of all publications from an author on Google scholar using BeautifulSoup. Since the home page for the author only has a truncated list of authors for each paper, I have to open the link of the paper to get full list. As a result, I ran into CAPTCHA every few attempts. Is there a way to avoid CAPTCHA (e.g. pause for 3 secs after every request)? Or make the original Google Scholar profile page to show full author list? 回答1: Recently I faced similar issue. I at

lxml / BeautifulSoup parser warning

阅读更多关于 lxml / BeautifulSoup parser warning

问题 Using Python 3, I'm trying to parse ugly HTML (which is not under my control) by using lxml with BeautifulSoup as explained here: http://lxml.de/elementsoup.html Specifically, I want to use lxml , but I'd like to use BeautifulSoup because like I said, it's ugly HTML and lxml will reject it on its own. The link above says: "All you need to do is pass it to the fromstring() function:" from lxml.html.soupparser import fromstring root = fromstring(tag_soup) So that's what I'm doing: URL = 'http:/

how to preserve links when scraping a table with beautiful soup and pandas

阅读更多关于 how to preserve links when scraping a table with beautiful soup and pandas

问题 Scraping a web to get a table, using Beautiful soup and Pandas . One of the columns got some urls. When I pass html to pandas, href are lost. is there any way of preserving the url link just for that column? Example data (edited for better suit ral case): <html> <body> <table> <tr> <td>customer</td> <td>country</td> <td>area</td> <td>website link</td> </tr> <tr> <td>IBM</td> <td>USA</td> <td>EMEA</td> <td><a href="http://www.ibm.com">IBM site</a></td> </tr> <tr> <td>CISCO</td> <td>USA</td>

Scraper in Python gives “Access Denied”

阅读更多关于 Scraper in Python gives “Access Denied”

问题 I'm trying to code a scraper in Python to get some info from a page. Like the title of the offers that appear on this page: https://www.justdial.com/Panipat/Saree-Retailers/nct-10420585 By now I use this code : import bs4 import requests def extract_source(url): source=requests.get(url).text return source def extract_data(source): soup=bs4.BeautifulSoup(source) names=soup.findAll('title') for i in names: print i extract_data(extract_source('https://www.justdial.com/Panipat/Saree-Retailers/nct