web-scraping

Scrape dynamic data using scrapy [closed]

只谈情不闲聊 提交于 2021-01-29 12:18:43
问题 Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 1 year ago . Improve this question I would like to scrape option chain of stock from nasdaq website using scrapy (along with other data) Nasdaq recently updated their website. Here is the url I am talking about. The data is not loaded with plain spider and in scrapy shell. From the scrapy docs, I

Can't get the fully loaded html for a page using puppeteer

可紊 提交于 2021-01-29 11:38:43
问题 I'm trying to get the full html for this page. It has a spreadsheet that loads slowly. I'm able to get the spreadsheet included when taking a screenshot of the page. However I can't get the html for the spreadsheet. document.body.outerHTML excludes the html for the spreadsheet. It's as if puppeteer is still seeing the page before the spreadsheet loads. How do I get the fully loaded HTML including the HTML for the spreadsheet? (async () => { const browser = await puppeteer.launch(); const page

Beautiful Soup returns 'none'

怎甘沉沦 提交于 2021-01-29 11:24:53
问题 I am using the following code to extract data using beautiful soup: import requests import bs4 res = requests.get('https://www.jmu.edu/cgi-bin/parking_sign_data.cgi?hash=53616c7465645f5f5c0bbd0eccccb6fe8dd7ed9a0445247e3c7dcb4f91927f7ccc933be780c6e558afb8ebf73620c3e5e3b2c68cd3c138519068eac99d9bf30e1e67ce894deb3a054f95f882da2ea2f0|869835tg89dhkdnbnsv5sg5wg0vmcf4mfcfc2qwm5968unmeh5') soup = bs4.BeautifulSoup(res.text, 'xml') soup.find_all("span", class_="text") I've tried different variations of

Missing values while scraping using beautifulsoup in python

假如想象 提交于 2021-01-29 11:18:23
问题 I'm trying to do web scraping as my first project using python (completely new to programming), I'm almost done, however some values on the web page are missing, so I want to replace that missing value with something like a "0" or "Not found", really I just want to make a csv file out of the data, not really going forward with the analysis. The web page I'm scraping is: https://www.lamudi.com.mx/nuevo-leon/departamento/for-rent/?page=1 I have a loop that collects all of te links of the page,

Loop through webpages and download all images

与世无争的帅哥 提交于 2021-01-29 11:12:25
问题 I have a nice URL structure to loop through: https://marco.ccr.buffalo.edu/images?page=0&score=Clear https://marco.ccr.buffalo.edu/images?page=1&score=Clear https://marco.ccr.buffalo.edu/images?page=2&score=Clear ... I want to loop through each of these pages and download the 21 images (JPEG or PNG). I've seen several Beautiful Soap examples, but Im still struggling to get something that will download multiple images and loop through the URLs. I think I can use urllib to loop through each URL

How to retreive scrapping data from web to json like format

烈酒焚心 提交于 2021-01-29 11:11:55
问题 I have try to scrape my data using jsoup and I am successfully to query all the data that I need from the web, but the problem is how to retrieve my data to json like format example of my data using cssQuery. Faculty of Engineering Computer Science Washington Understanding algorithm and data structures Implement to solve real problem Good understanding how computer work Mechanical Engineering New York Understand how machine works Can implement the theory to solve real problem Faculty of

Beautiful Soup find() returns None?

我的未来我决定 提交于 2021-01-29 11:00:40
问题 I am trying to parse the HTML on this website. I would like to get the text from all these span elements with class = "post-subject" Examples: <span class="post-subject">Set of 20 moving boxes (20009 or 20011)</span> <span class="post-subject">Firestick/Old xbox games</span> When I run my code below, soup.find() returns None . I'm not sure what's going on? import requests from bs4 import BeautifulSoup page = requests.get('https://trashnothing.com/washington-dc-freecycle?page=1') soup =

Cannot scrape website using Selenium in Python

帅比萌擦擦* 提交于 2021-01-29 10:57:27
问题 Trying to scrape off some data from footlocker.com for an academic project but I get an error when I try to open the page using Selenium: from selenium import webdriver driver = webdriver.Chrome("/Users/rushabhnahar/Downloads/chromedriver") driver.get("https://www.footlocker.com/adidas-Originals/Shoes/_-_/N- zrZrj?cm_REF=Shoes&crumbs=991") It opens up the browser and the respective page but gives an error saying 'We are sorry'. Any help will be appreciated. 来源: https://stackoverflow.com

Need help scraping images from a slideshow with bs4 & python

眉间皱痕 提交于 2021-01-29 10:36:48
问题 I'm trying scrap listing information from Craigslist, unfortunately I can't seem to get the images since they are in a slideshow. import requests from bs4 import BeautifulSoup as soup url = "https://newyork.craigslist.org/search/sss" r = requests.get(url) souped = soup(r.content, 'lxml') Since the images aren't even in the html file requested, do I need to somehow dynamically load the page or something. If so can I keep it only in python, I don't want any other dependencies. Thanks in advance

VBA scraping with javaScript - Troubles to use execScript?

ⅰ亾dé卋堺 提交于 2021-01-29 10:30:56
问题 I’d need to create a table with the result of the earnings. I was able to manage the first page, but I need to complete the table with all the previous earnings as well. I don’t know how to “click” the “Mostrar Mas” button using exeScript, get the result and keep doing it till the table is completed This is the code I’ve put together so far. Sub fundamentals() 'Primer pagina de earnings Dim XMLReq As New MSXML2.XMLHTTP60 Dim HTMLDoc As New MSHTML.HTMLDocument Dim TRs As MSHTML