web-scraping

How can I export data from DataCamp?

情到浓时终转凉″ 提交于 2021-02-07 09:40:57
问题 I use DataCamp for online learning of R. Sometimes I would like to export the data used in the exercises, but I can't find an easy way to do so. I know that there are instructions for downloading the videos or slides, and some courses provide some selected datasets for download in the course description. But how do I download data that I can access through the DataCamp exercise interface for export outside of the platform? 回答1: A comfortable way to move data from one session to another is the

Scrapy - select xpath with a regular expression

。_饼干妹妹 提交于 2021-02-07 09:33:43
问题 Part of the html that I am scraping looks like this: <h2> <span class="headline" id="Profile">Profile</span></h2> <ul><li> <b>Name</b> Albert Einstein </li><li> <b>Birth Name:</b> Alberto Ein </li><li> <b>Birthdate:</b> December 24, 1986 </li><li> <b>Birthplace:</b> <a href="/Ulm" title="Dest">Ulm</a>, Germany </li><li> <b>Height:</b> 178cm </li><li> <b>Blood Type:</b> A </li></ul> I want to extract each component - so name, birth name, birthday, etc. To extract the name I do: a_name =

Scraping an AJAX page using VBA

北战南征 提交于 2021-02-06 13:59:49
问题 I've been trying to Scrape the entire HTML body and assign it as a string variable before manipulating that string to populate an excel file - this will be done on a a loop to update the date every 5 minute interval. These pages are AJAX pages, so run what looks like JavaScript (I'm not familiar with JS at all though). I've tried using the XMLHttpRequest object (code below) but t returns the JS Calls: Set XMLHTTP = CreateObject("MSXML2.serverXMLHTTP") XMLHTTP.Open "GET", "https://www.google

Scraping an AJAX page using VBA

雨燕双飞 提交于 2021-02-06 13:59:16
问题 I've been trying to Scrape the entire HTML body and assign it as a string variable before manipulating that string to populate an excel file - this will be done on a a loop to update the date every 5 minute interval. These pages are AJAX pages, so run what looks like JavaScript (I'm not familiar with JS at all though). I've tried using the XMLHttpRequest object (code below) but t returns the JS Calls: Set XMLHTTP = CreateObject("MSXML2.serverXMLHTTP") XMLHTTP.Open "GET", "https://www.google

CSS Selector to get the element attribute value

此生再无相见时 提交于 2021-02-06 10:50:45
问题 The HTML structure is like this: <td class='hey'> <a href="https://example.com">First one</a> </td> This is my selector: m_URL = sel.css("td.hey a:nth-child(1)[href] ").extract() My selector now will output <a href="https://example.com">First one</a> , but I only want it to output the link itself: https://example.com . How can I do that? 回答1: Get the ::attr(value) from the a tag. Demo (using Scrapy shell): $ scrapy shell index.html >>> response.css('td.hey a:nth-child(1)::attr(href)').extract

Puppeteer: how to download entire web page for offline use

断了今生、忘了曾经 提交于 2021-02-06 09:07:04
问题 How would I scrape an entire website, with all of its CSS/JavaScript/media intact (and not just its HTML), with Google's Puppeteer? After successfully trying it out on other scraping jobs, I would imagine it should be able to. However, looking through the many excellent examples online, there is no obvious method for doing so. The closest I have been able to find is calling html_contents = await page.content() and saving the results, but that saves a copy without any non-HTML elements. Is

Puppeteer: how to download entire web page for offline use

大城市里の小女人 提交于 2021-02-06 09:01:46
问题 How would I scrape an entire website, with all of its CSS/JavaScript/media intact (and not just its HTML), with Google's Puppeteer? After successfully trying it out on other scraping jobs, I would imagine it should be able to. However, looking through the many excellent examples online, there is no obvious method for doing so. The closest I have been able to find is calling html_contents = await page.content() and saving the results, but that saves a copy without any non-HTML elements. Is

Crawling multiple URL in a loop using puppeteer

对着背影说爱祢 提交于 2021-02-05 21:34:44
问题 I have urls = ['url','url','url'...] this is what I'm doing urls.map(async (url)=>{ await page.goto(url); await page.waitForNavigation({ waitUntil: 'networkidle' }); }) This seems to not wait for page load and visit all the urls quite rapidly(i even tried using page.waitFor ) just wanted to know am I doing something fundamentally wrong or this type of functionality is not advised/supported 回答1: map , forEach , reduce , etc, does not wait for the asynchronous operation within them, before they

For loop doesn't work for web scraping Google search in python

混江龙づ霸主 提交于 2021-02-05 12:21:26
问题 I'm working on web-scraping Google search with a list of keywords. The nested For loop for scraping a single page works well. However, the other for loop searching keywords in the list does not work as I intended to which scrap the data for each searching result. The results didn't get the search outcome of the first two keywords, but it got only the result of the last keyword. Here is the code: browser = webdriver.Chrome(r"C:\...\chromedriver.exe") df = pd.DataFrame(columns = ['ceo', 'value'

Scraping data with vba from Yahoo finance

梦想的初衷 提交于 2021-02-05 12:20:32
问题 I need to read the closing price of a stock from the Yahoo Finance page. I had this answered before using Google Finance page, but the page is no longer available and I believe Google has completely changed its Finance page. I believe I can apply the same on Yahoo Finance with little modification. Let s say Yahoo Finance has the following code for the stock symobol AAPL (Apple): ![YAHOO WEBPAGE CODE FOR AAPL][1] I need to only extract the value 172.77. This was working perfectly with Google