web-scraping

How to retrieve a sub-string from a string that changes dynamically with respect to multiple delimiters through Selenium in Python

谁说我不能喝 提交于 2021-01-28 08:50:43
问题 I wonder if its possible to remove part of the scraped string like: Wujek Drew / Uncle Drew into Uncle Drew Of course, as it is web scraping, the titles will be different every time, so what can I do here to get the result above? Update I forgot to add something that need to be removed also. Wujek Drew / Uncle Drew (2018) I Will need to delete the data at the end of the string. 回答1: To remove first part of the scraped string separated by / character you can use the following solution: value =

Extract date from multiple webpages with Python

﹥>﹥吖頭↗ 提交于 2021-01-28 08:35:08
问题 I want to extract date when news article was published on websites. For some websites I have exact html element where date/time is (div, p, time) but on some websites I do not have: These are the links for some websites (german websites): (3 Nov 2020) http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226 (Dec. 1, 2020) http://www.reutigen.ch/de/aktuelles/aktuellesinformationen/welcome.php?action=showinfo&info_id=1066837&ls=0&sq=&kategorie_id=&date_from=

Web Scraping interactive map (javascript) with R and PhantomJS

你离开我真会死。 提交于 2021-01-28 08:10:29
问题 I am trying to scrape data from an interactive map (looking to get crime data for a county). I am using R (rvest) and trying to use phantomjs too. I'm new to web scraping so I am not really understanding how all the elements work together (trying to get there). The problem I believe I am having is that after I run the phantomjs and upload the html using R's rvest package, I end up with more scripts and no clear data in the html. My code is below. writeLines("var url = 'http://www.google.com';

HTML table data can not be extracted

大城市里の小女人 提交于 2021-01-28 08:02:34
问题 I am trying to pull some data from a HTML table, however I keep getting an ERROR Message 13 "Type Mismatch" . It may be because i am using the wrong html tag. I have been struggling on this for a few days now, as I can not work out the correct process. This is the HTML that I am using The Code Dim htmlTable As Object Dim collTD As Collection Dim oNode As Object Dim IE As Object Set IE = CreateObject("InternetExplorer.application") With IE .Visible = True .navigate "https://ukonlinestores.co

I get InvalidURL: URL can't contain control characters when I try to send a request using urllib

隐身守侯 提交于 2021-01-28 07:48:06
问题 I am trying to get a JSON response from the link used as a parameter to the urllib request. but it gives me an error that it can't contain control characters. how can I solve the issue? start_url = "https://devbusiness.un.org/solr-sitesearch-output/10//0/ds_field_last_updated/desc?bundle_fq =procurement_notice&sm_vid_Institutions_fq=&sm_vid_Procurement_Type_fq=&sm_vid_Countries_fq=&sm_vid_Sectors_fq= &sm_vid_Languages_fq=English&sm_vid_Notice_Type_fq=&deadline_multifield_fq=&ts_field_project

scrape website using nodejs cheerio deep nested element tags

社会主义新天地 提交于 2021-01-28 06:50:12
问题 I'm trying to scrape text from a website but can't seem to extract anything. below is the structure and code. My code: const rp = require("request-promise"); const $ = require("cheerio"); const url = "xx"; rp(url) .then(function(html) { //success! let token = "ce-bodytext"; console.log($(token, response).length); console.log($(token, html)).text; }) .catch(function(err) { console.log(JSON.stringify(err)); }); While I just need the text, there was no id to the tag. Also, I was hoping ce

Loop pages and save contents in Excel file from website in Python

…衆ロ難τιáo~ 提交于 2021-01-28 06:14:27
问题 I'm trying to loop pages from this link and extract the interesting part. Please see the contents in the red circle in the image below. Here's what I've tried: url = 'http://so.eastmoney.com/Ann/s?keyword=购买物业&pageindex={}' for page in range(10): r = requests.get(url.format(page)) soup = BeautifulSoup(r.content, "html.parser") print(soup) xpath for each element (might be helpful for those that don't read Chinese): /html/body/div[3]/div/div[2]/div[2]/div[3]/h3/span --> 【润华物业】 /html/body/div[3]

Problem with __VIEWSTATE, __EVENTVALIDATION, __EVENTTARGET and scrapy & splash

别说谁变了你拦得住时间么 提交于 2021-01-28 06:04:35
问题 How do i handle __VIEWSTATE, __EVENTVALIDATION, __EVENTTARGET with scrapy/splash? I tried with return FormRequest.from_response(response, [...] '__VIEWSTATE': response.css( 'input#__VIEWSTATE::attr(value)').extract_first(), But this does not work. 回答1: You'll need to use a dict as the formdata keyword arg. (I'd also recommend extracting into variables first for readability) def parse(self, response): vs = response.css('input#__VIEWSTATE::attr(value)').extract_first() ev = # another extraction

Unable to login with Puppeteer

半腔热情 提交于 2021-01-28 05:59:11
问题 I am trying to login to Moz at https://moz.com/login with Puppeteer using the following code: const puppeteer = require('puppeteer'); const creds = { email: "myemail", password: "mypassword" }; (async () => { const browser = await puppeteer.launch({ args: [ '--disable-web-security', ], headless: false }); const page = await browser.newPage(); await page.goto("https://moz.com/login"); await page.$eval("input[name=email]", (el, value) => el.value = value, creds.email); await page.$eval("input

Html Agility Pack how to get dynamically generated content after page loads

こ雲淡風輕ζ 提交于 2021-01-28 05:50:45
问题 I am attempting to get information from "https://www.sideshow.com/collectibles?manufacturer=Hot+Toys" specifically Div c-ProductList row ss-targeted but no information seems to be retrieved, any clues var test = page.DocumentNode.SelectNodes("//div[@class='c-ProductList row ss-targeted']"); 回答1: The content you want to get is generated after the page loads, using Javascript and Ajax. HAP cannot get it unless it runs a browser in background and execute the scripts on the page. .Net Core 2.0