web-scraping | 易学教程

How can I export data from DataCamp?

阅读更多关于 How can I export data from DataCamp?

问题 I use DataCamp for online learning of R. Sometimes I would like to export the data used in the exercises, but I can't find an easy way to do so. I know that there are instructions for downloading the videos or slides, and some courses provide some selected datasets for download in the course description. But how do I download data that I can access through the DataCamp exercise interface for export outside of the platform? 回答1: A comfortable way to move data from one session to another is the

Scrapy - select xpath with a regular expression

阅读更多关于 Scrapy - select xpath with a regular expression

问题 Part of the html that I am scraping looks like this: <h2> Profile</h2> <ul><li> Name Albert Einstein </li><li> Birth Name: Alberto Ein </li><li> Birthdate: December 24, 1986 </li><li> Birthplace: <a href="/Ulm" title="Dest">Ulm</a>, Germany </li><li> Height: 178cm </li><li> Blood Type: A </li></ul> I want to extract each component - so name, birth name, birthday, etc. To extract the name I do: a_name =

Scraping an AJAX page using VBA

阅读更多关于 Scraping an AJAX page using VBA

问题 I've been trying to Scrape the entire HTML body and assign it as a string variable before manipulating that string to populate an excel file - this will be done on a a loop to update the date every 5 minute interval. These pages are AJAX pages, so run what looks like JavaScript (I'm not familiar with JS at all though). I've tried using the XMLHttpRequest object (code below) but t returns the JS Calls: Set XMLHTTP = CreateObject("MSXML2.serverXMLHTTP") XMLHTTP.Open "GET", "https://www.google

Scraping an AJAX page using VBA

阅读更多关于 Scraping an AJAX page using VBA

CSS Selector to get the element attribute value

阅读更多关于 CSS Selector to get the element attribute value

问题 The HTML structure is like this: <td class='hey'> <a href="https://example.com">First one</a> </td> This is my selector: m_URL = sel.css("td.hey a:nth-child(1)[href] ").extract() My selector now will output <a href="https://example.com">First one</a> , but I only want it to output the link itself: https://example.com . How can I do that? 回答1: Get the ::attr(value) from the a tag. Demo (using Scrapy shell): $ scrapy shell index.html >>> response.css('td.hey a:nth-child(1)::attr(href)').extract

Puppeteer: how to download entire web page for offline use

阅读更多关于 Puppeteer: how to download entire web page for offline use

问题 How would I scrape an entire website, with all of its CSS/JavaScript/media intact (and not just its HTML), with Google's Puppeteer? After successfully trying it out on other scraping jobs, I would imagine it should be able to. However, looking through the many excellent examples online, there is no obvious method for doing so. The closest I have been able to find is calling html_contents = await page.content() and saving the results, but that saves a copy without any non-HTML elements. Is

Puppeteer: how to download entire web page for offline use

阅读更多关于 Puppeteer: how to download entire web page for offline use

Crawling multiple URL in a loop using puppeteer

阅读更多关于 Crawling multiple URL in a loop using puppeteer

问题 I have urls = ['url','url','url'...] this is what I'm doing urls.map(async (url)=>{ await page.goto(url); await page.waitForNavigation({ waitUntil: 'networkidle' }); }) This seems to not wait for page load and visit all the urls quite rapidly(i even tried using page.waitFor ) just wanted to know am I doing something fundamentally wrong or this type of functionality is not advised/supported 回答1: map , forEach , reduce , etc, does not wait for the asynchronous operation within them, before they

For loop doesn't work for web scraping Google search in python

阅读更多关于 For loop doesn't work for web scraping Google search in python

问题 I'm working on web-scraping Google search with a list of keywords. The nested For loop for scraping a single page works well. However, the other for loop searching keywords in the list does not work as I intended to which scrap the data for each searching result. The results didn't get the search outcome of the first two keywords, but it got only the result of the last keyword. Here is the code: browser = webdriver.Chrome(r"C:\...\chromedriver.exe") df = pd.DataFrame(columns = ['ceo', 'value'

Scraping data with vba from Yahoo finance

阅读更多关于 Scraping data with vba from Yahoo finance

问题 I need to read the closing price of a stock from the Yahoo Finance page. I had this answered before using Google Finance page, but the page is no longer available and I believe Google has completely changed its Finance page. I believe I can apply the same on Yahoo Finance with little modification. Let s say Yahoo Finance has the following code for the stock symobol AAPL (Apple): ![YAHOO WEBPAGE CODE FOR AAPL][1] I need to only extract the value 172.77. This was working perfectly with Google