web-scraping

Puppeteer can't access HTTPS site with proxy server

女生的网名这么多〃 提交于 2021-01-29 15:42:59
问题 Here's my Nodejs code in which I'm trying to access a https site using https proxy but it doesn't seem to work, meanwhile the http proxy works fine. I have researched but nothing worked. const puppeteer = require("puppeteer-extra"); const useProxy = require("puppeteer-page-proxy"); const StealthPlugin = require("puppeteer-extra-plugin-stealth"); const AdblockerPlugin = require("puppeteer-extra-plugin-adblocker"); puppeteer.use(StealthPlugin()); puppeteer.use(AdblockerPlugin({ blockTrackers:

getting table value from nowgoal has got an index error

廉价感情. 提交于 2021-01-29 15:37:20
问题 I am quite new to scraping. I am getting links from nowgoal below is how I started navigating to above page. I do not wish to get link for all matches. but I will have an input txt file, which is attached Here and use the selected league and date. The following code will initialize as input: #Intialisation league_index =[] final_list = [] j = 0 #config load config = RawConfigParser() configFilePath = r'.\config.txt' config.read(configFilePath) date = config.get('database_config','date')

How to scrape the dynamic table data

ぐ巨炮叔叔 提交于 2021-01-29 14:11:04
问题 I want to scrape the table data from http://5000best.com/websites/ The content of the table is paginated upto several pages and are dynamic. I want to scrape the table data for each category. I can scrape the table manually for each category but this is not what I want. Please look at it and give me the approach to do it. I am able to make links for each category i.e. http://5000best.com/websites/Movies/, http://5000best.com/websites/Games/ etc. But I am not sure how to make it further to

extract the number of results from google search

我的未来我决定 提交于 2021-01-29 13:56:48
问题 I am writing a web scraper to extract the number of results of searching in a google search which appears on the top left of the page of search results. I have written the code below but I do not understand why phrase_extract is None. I want to extract the phrase "About 12,010,000,000 results". which part I am making a mistake? may be parsing the HTML incorrectly? import requests from bs4 import BeautifulSoup def pyGoogleSearch(word): address='http://www.google.com/#q=' newword=address+word

How to perform web scraping to get all the reviews of the an app in Google Play?

一世执手 提交于 2021-01-29 13:40:18
问题 I pretend to be able to get all the reviews that users leave on Google Play about the apps. I have this code that they indicated there Web scrapping in R through Google playstore . But the problem is that you only get the first 40 reviews. Is there a possibility to get all the comments of the app? `` ` #Loading the rvest package library(rvest) library(magrittr) # for the '%>%' pipe symbols library(RSelenium) # to get the loaded html of #Specifying the url for desired website to be scrapped

Exclude non wanted html from Simple Html Dom - PHP

笑着哭i 提交于 2021-01-29 13:25:26
问题 I am using HTML Simple Dom Parser with PHP to get title, description and images from a website. The issue I am facing is I am getting the html which I dont want and how to exclude those html tags. Below is the explanation. Here is a sample html structure which is being parsed. <div id="product_description"> <p> Some text</p> <ul> <li>value 1</li> <li>value 2</li> <li>value 3</li> </ul> // the div I dont want <div id="comments"> <h1> Some Text </h1> </div> </div> I am using below php script to

Scrape dynamic data using scrapy [closed]

若如初见. 提交于 2021-01-29 13:13:21
问题 Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 1 year ago . Improve this question I would like to scrape option chain of stock from nasdaq website using scrapy (along with other data) Nasdaq recently updated their website. Here is the url I am talking about. The data is not loaded with plain spider and in scrapy shell. From the scrapy docs, I

Webscrap VBA - List

不羁的心 提交于 2021-01-29 12:39:39
问题 I am trying to set up a webscrapping VBA code to import data into Excel from this website: https://www.thewindpower.net/windfarms_list_en.php I wish to launch this webpage, select a country and then scrap the data from the table below (including url from the name column). Yet, I am stuck with several points: How can I select the country I wish in VBA code ? How can I select the table as there is no id or class in the tag ? How can I import the URL included in the name column ? Here is the

Beautiful soup multiple Span Extract Table

别等时光非礼了梦想. 提交于 2021-01-29 12:31:03
问题 I am currently working on my class assignment. I have to extract the data from the SPECS table from this webpage. https://www.consumerreports.org/products/drip-coffee-maker/behmor-connected-alexa-enabled-temperature-control-396982/overview/ The data I need is stored as <h2 class="crux-product-title">Specs</h2> </div> </div> <div class="row"> <div class="col-xs-12"> <div class="product-model-features-specs-item"> <div class="row"> <div class='col-lg-6 col-md-6 col-sm-6 col-xs-12 product-model

Scraping Pricing off a search Bar - site link changed

喜夏-厌秋 提交于 2021-01-29 12:29:16
问题 With the help of some experts here I was able to build a scraper that works fine. The essential line of code is really: data = {"partOptionFilter": {"PartNumber": PN.iloc[i, 0], "AlternativeOemId": "17155"}} r = requests.post('https://www.partssource.com/catalog/Service', json=data).json()" However the site recently changed their link from partsfinder.com to partssource.com, and the code longer seems to work. Just wondering if there's a trick I can use on my original code to have it working