Puppeteer

Detecting Navigation with Puppeteer

时光总嘲笑我的痴心妄想 提交于 2020-12-06 08:09:12
问题 I'm looking for a "best practice" (besides "don't do that") when using Puppeteer with a page that may (but not always) reload when a radio button is clicked, a select option is selected, etc. The use case is I'm navigating an eCommerce page with options, and some of those options cause the page to reload, some of them don't. I've tried hooking into the on("load") event to try and catch when this happens, and reset my page variable, but I can't get the "timing" correct, and still end up with

How do facebook ads spy tools scrape the data?

一个人想着一个人 提交于 2020-12-06 07:10:22
问题 How do these websites like BigSpy get the latest ads from Facebook? Do you need to scrape all the pages on FB to a database and then use Puppeteer or Selenium to click through the Facebook page UI and look at the ads? Or is it somehow done through Facebook Ad library? Could someone share the process by which one could code it up themselves? Thanks. Example website: https://bigspy.com/adspy/facebook/ 回答1: You can do with scraping tools puppeteer with searching special items on facebook or

How do facebook ads spy tools scrape the data?

ぃ、小莉子 提交于 2020-12-06 07:10:12
问题 How do these websites like BigSpy get the latest ads from Facebook? Do you need to scrape all the pages on FB to a database and then use Puppeteer or Selenium to click through the Facebook page UI and look at the ads? Or is it somehow done through Facebook Ad library? Could someone share the process by which one could code it up themselves? Thanks. Example website: https://bigspy.com/adspy/facebook/ 回答1: You can do with scraping tools puppeteer with searching special items on facebook or

深入细枝末节,Python的字体反爬虫到底怎么一回事

被刻印的时光 ゝ 提交于 2020-12-06 04:58:24
内容选自 即将出版 的《Python3 反爬虫原理与绕过实战》,本次公开书稿范围为第 6 章——文本混淆反爬虫。本篇为第 6 章中的第 4 小节,其余小节将 逐步放送 。 字体反爬虫开篇概述 在 CSS3 之前,Web 开发者必须使用用户计算机上已有的字体。但是在 CSS3 时代,开发者可以使用@font-face 为网页指定字体,对用户计算机字体的依赖。开发者可将心仪的字体文件放在 Web 服务器上,并在 CSS 样式中使用它。用户使用浏览器访问 Web 应用时,对应的字体会被浏览器下载到用户的计算机上。 在学习浏览器和页面渲染的相关知识时,我们了解到 CSS 的作用是修饰 HTML ,所以在页面渲染的时候不会改变 HTML 文档内容。由于字体的加载和映射工作是由 CSS 完成的,所以即使我们借助 Splash、Selenium 和 Puppeteer 工具也无法获得对应的文字内容。字体反爬虫正是利用了这个特点,将自定义字体应用到网页中重要的数据上,使得爬虫程序无法获得正确的数据。 6.4.1 字体反爬虫示例 示例 7:字体反爬虫示例。 网址: http://www.porters.vip/confus... 。 任务:爬取影片信息展示页中的影片评分、评价人数和票房数据,页面内容如图 6-32 所示。 图 6-32 示例 7 页面 在编写代码之前,我们需要确定目标数据的元素定位

最完美方案!模拟浏览器如何正确隐藏特征

谁都会走 提交于 2020-12-04 01:40:20
在前天的公众号文章 《别去送死了。Selenium 与 Puppeteer 能被网站探测的几十个特征》 中,我们提到目前网上的反检测方法几乎都是掩耳盗铃,因为模拟浏览器有几十个特征可以被检测,仅仅隐藏 webdriver 这一个值是没有任何意义的。 今天我们就来说说应该如何正确解决这个问题。我们首先给出解决方案。然后再说明这个解决方案,我是通过什么方式找到的。 解决这个问题的关键,就是一个 js 文件,叫做 stealth.min.js 。稍后我会说明如何生成这个文件。 我们需要设定,让 Selenium 或者 Pyppeteer 在打开任何页面 之前 ,先运行这个 Js 文件。具体的做法和原理,大家可以参考我这两篇文章: (最新版)如何正确移除Selenium中的 window.navigator.webdriver (最新版)如何正确移除 Pyppeteer 中的window.navigator.webdriver 这里,我以 Selenium 为例来说明如何操作,我们编写如下代码: import time from selenium.webdriver import Chrome from selenium.webdriver.chrome.options import Options chrome_options = Options() chrome_options.add

Puppeteer doesn't close browser

丶灬走出姿态 提交于 2020-12-02 06:55:30
问题 I'm running puppeteer on express/node/ubuntu as follow: var puppeteer = require('puppeteer'); var express = require('express'); var router = express.Router(); /* GET home page. */ router.get('/', function(req, res, next) { (async () => { headless = true; const browser = await puppeteer.launch({headless: true, args:['--no-sandbox']}); const page = await browser.newPage(); url = req.query.url; await page.goto(url); let bodyHTML = await page.evaluate(() => document.body.innerHTML); res.send

Puppeteer doesn't close browser

冷暖自知 提交于 2020-12-02 06:54:19
问题 I'm running puppeteer on express/node/ubuntu as follow: var puppeteer = require('puppeteer'); var express = require('express'); var router = express.Router(); /* GET home page. */ router.get('/', function(req, res, next) { (async () => { headless = true; const browser = await puppeteer.launch({headless: true, args:['--no-sandbox']}); const page = await browser.newPage(); url = req.query.url; await page.goto(url); let bodyHTML = await page.evaluate(() => document.body.innerHTML); res.send

Possible to Get Puppeteer Audio Feed and/or Input Audio Directly to Puppeteer?

霸气de小男生 提交于 2020-12-01 10:42:26
问题 I want to input WAV or MP3 into puppeteer as a microphone, however while in headless the application is muted, so I was wondering if there was a way to get input directly into the browser. I am also wondering if it's possible to get a feed of audio from the browser while in headless, and/or record the audio and place it in a folder. 回答1: I ended up using this solution. First, I enabled some options for Chromium: const browser = await puppeteer.launch({ args: [ '--use-fake-ui-for-media-stream'

How to set max viewport in Puppeteer?

给你一囗甜甜゛ 提交于 2020-11-30 04:17:13
问题 When I run a new page, I must specify size of the viewport using the setViewport function: await page.setViewport({ width: 1920, height: 1080 }) I want use max viewport. How can I make the viewport resizable according to the window size? 回答1: I may be very late on this. Nevertheless for others, try: const browser = await puppeteer.launch({defaultViewport: null}); Set the defaultViewport option to null as above to disable the 800x600 resolution. It takes the max resolution then. 回答2: You can

How to detect the request come from Puppeteer?

帅比萌擦擦* 提交于 2020-11-29 03:10:43
问题 I wonder whether or not exists some flag or tag that the website can use it to detect the request came from Puppeteer? When I ran my code based on Puppeteer to visit the target website, I found that the website seems to know the request was made by Puppeteer. How can it do? 回答1: If you are running the puppeteer and would like to pass some information to the website to catch your crawling, the best way to do so would be to set a custom user agent: const browser = await puppeteer.launch({ args: