web-scraping | 易学教程

Puppeteer can't access HTTPS site with proxy server

阅读更多关于 Puppeteer can't access HTTPS site with proxy server

问题 Here's my Nodejs code in which I'm trying to access a https site using https proxy but it doesn't seem to work, meanwhile the http proxy works fine. I have researched but nothing worked. const puppeteer = require("puppeteer-extra"); const useProxy = require("puppeteer-page-proxy"); const StealthPlugin = require("puppeteer-extra-plugin-stealth"); const AdblockerPlugin = require("puppeteer-extra-plugin-adblocker"); puppeteer.use(StealthPlugin()); puppeteer.use(AdblockerPlugin({ blockTrackers:

getting table value from nowgoal has got an index error

阅读更多关于 getting table value from nowgoal has got an index error

问题 I am quite new to scraping. I am getting links from nowgoal below is how I started navigating to above page. I do not wish to get link for all matches. but I will have an input txt file, which is attached Here and use the selected league and date. The following code will initialize as input: #Intialisation league_index =[] final_list = [] j = 0 #config load config = RawConfigParser() configFilePath = r'.\config.txt' config.read(configFilePath) date = config.get('database_config','date')

How to scrape the dynamic table data

阅读更多关于 How to scrape the dynamic table data

问题 I want to scrape the table data from http://5000best.com/websites/ The content of the table is paginated upto several pages and are dynamic. I want to scrape the table data for each category. I can scrape the table manually for each category but this is not what I want. Please look at it and give me the approach to do it. I am able to make links for each category i.e. http://5000best.com/websites/Movies/, http://5000best.com/websites/Games/ etc. But I am not sure how to make it further to

extract the number of results from google search

阅读更多关于 extract the number of results from google search

问题 I am writing a web scraper to extract the number of results of searching in a google search which appears on the top left of the page of search results. I have written the code below but I do not understand why phrase_extract is None. I want to extract the phrase "About 12,010,000,000 results". which part I am making a mistake? may be parsing the HTML incorrectly? import requests from bs4 import BeautifulSoup def pyGoogleSearch(word): address='http://www.google.com/#q=' newword=address+word

How to perform web scraping to get all the reviews of the an app in Google Play?

阅读更多关于 How to perform web scraping to get all the reviews of the an app in Google Play?

问题 I pretend to be able to get all the reviews that users leave on Google Play about the apps. I have this code that they indicated there Web scrapping in R through Google playstore . But the problem is that you only get the first 40 reviews. Is there a possibility to get all the comments of the app? `` ` #Loading the rvest package library(rvest) library(magrittr) # for the '%>%' pipe symbols library(RSelenium) # to get the loaded html of #Specifying the url for desired website to be scrapped

Exclude non wanted html from Simple Html Dom - PHP

阅读更多关于 Exclude non wanted html from Simple Html Dom - PHP

问题 I am using HTML Simple Dom Parser with PHP to get title, description and images from a website. The issue I am facing is I am getting the html which I dont want and how to exclude those html tags. Below is the explanation. Here is a sample html structure which is being parsed. <div id="product_description"> <p> Some text</p> <ul> <li>value 1</li> <li>value 2</li> <li>value 3</li> </ul> // the div I dont want <div id="comments"> <h1> Some Text </h1> </div> </div> I am using below php script to

Scrape dynamic data using scrapy [closed]

阅读更多关于 Scrape dynamic data using scrapy [closed]

问题 Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 1 year ago . Improve this question I would like to scrape option chain of stock from nasdaq website using scrapy (along with other data) Nasdaq recently updated their website. Here is the url I am talking about. The data is not loaded with plain spider and in scrapy shell. From the scrapy docs, I

Webscrap VBA - List

阅读更多关于 Webscrap VBA - List

问题 I am trying to set up a webscrapping VBA code to import data into Excel from this website: https://www.thewindpower.net/windfarms_list_en.php I wish to launch this webpage, select a country and then scrap the data from the table below (including url from the name column). Yet, I am stuck with several points: How can I select the country I wish in VBA code ? How can I select the table as there is no id or class in the tag ? How can I import the URL included in the name column ? Here is the

Beautiful soup multiple Span Extract Table

阅读更多关于 Beautiful soup multiple Span Extract Table

问题 I am currently working on my class assignment. I have to extract the data from the SPECS table from this webpage. https://www.consumerreports.org/products/drip-coffee-maker/behmor-connected-alexa-enabled-temperature-control-396982/overview/ The data I need is stored as <h2 class="crux-product-title">Specs</h2> </div> </div> <div class="row"> <div class="col-xs-12"> <div class="product-model-features-specs-item"> <div class="row"> <div class='col-lg-6 col-md-6 col-sm-6 col-xs-12 product-model

Scraping Pricing off a search Bar - site link changed

阅读更多关于 Scraping Pricing off a search Bar - site link changed

问题 With the help of some experts here I was able to build a scraper that works fine. The essential line of code is really: data = {"partOptionFilter": {"PartNumber": PN.iloc[i, 0], "AlternativeOemId": "17155"}} r = requests.post('https://www.partssource.com/catalog/Service', json=data).json()" However the site recently changed their link from partsfinder.com to partssource.com, and the code longer seems to work. Just wondering if there's a trick I can use on my original code to have it working