web-scraping | 易学教程

How to retrieve a sub-string from a string that changes dynamically with respect to multiple delimiters through Selenium in Python

阅读更多关于 How to retrieve a sub-string from a string that changes dynamically with respect to multiple delimiters through Selenium in Python

问题 I wonder if its possible to remove part of the scraped string like: Wujek Drew / Uncle Drew into Uncle Drew Of course, as it is web scraping, the titles will be different every time, so what can I do here to get the result above? Update I forgot to add something that need to be removed also. Wujek Drew / Uncle Drew (2018) I Will need to delete the data at the end of the string. 回答1: To remove first part of the scraped string separated by / character you can use the following solution: value =

Extract date from multiple webpages with Python

阅读更多关于 Extract date from multiple webpages with Python

问题 I want to extract date when news article was published on websites. For some websites I have exact html element where date/time is (div, p, time) but on some websites I do not have: These are the links for some websites (german websites): (3 Nov 2020) http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226 (Dec. 1, 2020) http://www.reutigen.ch/de/aktuelles/aktuellesinformationen/welcome.php?action=showinfo&info_id=1066837&ls=0&sq=&kategorie_id=&date_from=

Web Scraping interactive map (javascript) with R and PhantomJS

阅读更多关于 Web Scraping interactive map (javascript) with R and PhantomJS

问题 I am trying to scrape data from an interactive map (looking to get crime data for a county). I am using R (rvest) and trying to use phantomjs too. I'm new to web scraping so I am not really understanding how all the elements work together (trying to get there). The problem I believe I am having is that after I run the phantomjs and upload the html using R's rvest package, I end up with more scripts and no clear data in the html. My code is below. writeLines("var url = 'http://www.google.com';

HTML table data can not be extracted

阅读更多关于 HTML table data can not be extracted

问题 I am trying to pull some data from a HTML table, however I keep getting an ERROR Message 13 "Type Mismatch" . It may be because i am using the wrong html tag. I have been struggling on this for a few days now, as I can not work out the correct process. This is the HTML that I am using The Code Dim htmlTable As Object Dim collTD As Collection Dim oNode As Object Dim IE As Object Set IE = CreateObject("InternetExplorer.application") With IE .Visible = True .navigate "https://ukonlinestores.co

I get InvalidURL: URL can't contain control characters when I try to send a request using urllib

阅读更多关于 I get InvalidURL: URL can't contain control characters when I try to send a request using urllib

问题 I am trying to get a JSON response from the link used as a parameter to the urllib request. but it gives me an error that it can't contain control characters. how can I solve the issue? start_url = "https://devbusiness.un.org/solr-sitesearch-output/10//0/ds_field_last_updated/desc?bundle_fq =procurement_notice&sm_vid_Institutions_fq=&sm_vid_Procurement_Type_fq=&sm_vid_Countries_fq=&sm_vid_Sectors_fq= &sm_vid_Languages_fq=English&sm_vid_Notice_Type_fq=&deadline_multifield_fq=&ts_field_project

scrape website using nodejs cheerio deep nested element tags

阅读更多关于 scrape website using nodejs cheerio deep nested element tags

问题 I'm trying to scrape text from a website but can't seem to extract anything. below is the structure and code. My code: const rp = require("request-promise"); const $ = require("cheerio"); const url = "xx"; rp(url) .then(function(html) { //success! let token = "ce-bodytext"; console.log($(token, response).length); console.log($(token, html)).text; }) .catch(function(err) { console.log(JSON.stringify(err)); }); While I just need the text, there was no id to the tag. Also, I was hoping ce

Loop pages and save contents in Excel file from website in Python

阅读更多关于 Loop pages and save contents in Excel file from website in Python

问题 I'm trying to loop pages from this link and extract the interesting part. Please see the contents in the red circle in the image below. Here's what I've tried: url = 'http://so.eastmoney.com/Ann/s?keyword=购买物业&pageindex={}' for page in range(10): r = requests.get(url.format(page)) soup = BeautifulSoup(r.content, "html.parser") print(soup) xpath for each element (might be helpful for those that don't read Chinese): /html/body/div[3]/div/div[2]/div[2]/div[3]/h3/span --> 【润华物业】 /html/body/div[3]

Problem with VIEWSTATE, EVENTVALIDATION, __EVENTTARGET and scrapy & splash

阅读更多关于 Problem with __VIEWSTATE, __EVENTVALIDATION, __EVENTTARGET and scrapy & splash

问题 How do i handle __VIEWSTATE, __EVENTVALIDATION, __EVENTTARGET with scrapy/splash? I tried with return FormRequest.from_response(response, [...] '__VIEWSTATE': response.css( 'input#__VIEWSTATE::attr(value)').extract_first(), But this does not work. 回答1: You'll need to use a dict as the formdata keyword arg. (I'd also recommend extracting into variables first for readability) def parse(self, response): vs = response.css('input#__VIEWSTATE::attr(value)').extract_first() ev = # another extraction

Unable to login with Puppeteer

阅读更多关于 Unable to login with Puppeteer

问题 I am trying to login to Moz at https://moz.com/login with Puppeteer using the following code: const puppeteer = require('puppeteer'); const creds = { email: "myemail", password: "mypassword" }; (async () => { const browser = await puppeteer.launch({ args: [ '--disable-web-security', ], headless: false }); const page = await browser.newPage(); await page.goto("https://moz.com/login"); await page.$eval("input[name=email]", (el, value) => el.value = value, creds.email); await page.$eval("input

Html Agility Pack how to get dynamically generated content after page loads

阅读更多关于 Html Agility Pack how to get dynamically generated content after page loads

问题 I am attempting to get information from "https://www.sideshow.com/collectibles?manufacturer=Hot+Toys" specifically Div c-ProductList row ss-targeted but no information seems to be retrieved, any clues var test = page.DocumentNode.SelectNodes("//div[@class='c-ProductList row ss-targeted']"); 回答1: The content you want to get is generated after the page loads, using Javascript and Ajax. HAP cannot get it unless it runs a browser in background and execute the scripts on the page. .Net Core 2.0