web-scraping | 易学教程

Find on beautiful soup in loop returns TypeError

阅读更多关于 Find on beautiful soup in loop returns TypeError

问题 I'm trying to scrape a table on an ajax page with Beautiful Soup and print it out in table form with the TextTable library. import BeautifulSoup import urllib import urllib2 import getpass import cookielib import texttable cj = cookielib.CookieJar() opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) urllib2.install_opener(opener) ... def show_queue(): url = 'https://www.animenfo.com/radio/nowplaying.php' values = {'ajax' : 'true', 'mod' : 'queue'} data = urllib.urlencode(values) f

python - HTTP Error 503 Service Unavailable

阅读更多关于 python - HTTP Error 503 Service Unavailable

问题 I am trying to scrape data from google and linkedin. Somehow it gave me this error: *** httperror_seek_wrapper: HTTP Error 503: Service Unavailable Can someone help advice how I solve this? 回答1: Google is simply detecting your query as automated. You would need a captcha solver to get unlimited results. The following link might be helpful. https://support.google.com/websearch/answer/86640?hl=en Bypassing Captcha using an OCR Engine: http://www.debasish.in/2012/01/bypass-captcha-using-python

Scrapy Pipeline doesn't insert into MySQL

阅读更多关于 Scrapy Pipeline doesn't insert into MySQL

问题 I'm trying to build a small app for a university project with Scrapy. The spider is scraping the items, but my pipeline is not inserting data into mysql database. In order to test whether the pipeline is not working or the pymysl implementation is not working I wrote a test script: Code Start #!/usr/bin/python3 import pymysql str1 = "hey" str2 = "there" str3 = "little" str4 = "script" db = pymysql.connect("localhost","root","**********","stromtarife" ) cursor = db.cursor() cursor.execute(

Clicking links with Python BeautifulSoup

阅读更多关于 Clicking links with Python BeautifulSoup

问题 So I'm new to Python (I come from a PHP/JavaScript background), but I just wanted to write a quick script that crawled a website and all children pages to find all a tags with href attributes, count how many there are and then click the link. I can count all of the links, but I can't figure out how to "click" the links and then return the response codes. from bs4 import BeautifulSoup import urllib2 import re def getLinks(url): html_page = urllib2.urlopen(url) soup = BeautifulSoup(html_page,

Excel VBA - Web Scraping - Get value in HTML Table cell

阅读更多关于 Excel VBA - Web Scraping - Get value in HTML Table cell

问题 I am trying to create a macro that scrapes a cargo tracking website. But I have to create 4 such macros as each airline has a different website. I am new to VBA and web scraping. I have put together a code that works for 1 website. But when I tried to replicate it for another one, I am stuck in the loop. I think it maybe how I am referring to the element, but like I said, I am new to VBA and have no clue about HTML. I am trying to get the "notified" value in the highlighted line from the

How can I loop over pages and get data from every page with selenium?

阅读更多关于 How can I loop over pages and get data from every page with selenium?

问题 I want to do a google search and collect the links to all hits so that I can click those links and extract data from them after collecting all links. How can I get the link from every hit? I've tried several solutions like using a for loop and a while True statement. I'll show some examples of the code below. I either get no data at all or I get only data (links) from 1 webpage. Can someone please help me figure out how to iterate over every page of the google search and get all the links so

How to click a button on a website using Puppeteer without any class, id ,… assigned to it?

阅读更多关于 How to click a button on a website using Puppeteer without any class, id ,… assigned to it?

问题 So I want to click on a button on a website. The button has no id, class,... So I should find a way to click the button with the name that's on it. In this example I should click by the name "Supreme®/The North Face® Leather Shoulder Bag" This is my code in Node.js const puppeteer = require('puppeteer'); let scrape = async () => { const browser = await puppeteer.launch({headless: false}); const page = await browser.newPage(); await page.goto('https://www.supremenewyork.com/shop/all/bags');

Vba - webscraping using ng-click

阅读更多关于 Vba - webscraping using ng-click

问题 I am using Selenium and I would like to be able to click on the following <a ng-click="download()">download</a>' This is an 'a' tag. I am not sure how the code would be like to click onto an 'a' tag that has got ng-click in it. Dim d As WebDriver Set d = New ChromeDriver Const URL = "url of the website - not public" With d .Start "Chrome" .get URL .Window.Maximize .FindElementById("Search").SendKeys "information to search" .Wait 1000 .FindElementById("Submit").Click .Wait 1000 'then I need to

Requests.get showing different HTML than Chrome's Developer Tool

阅读更多关于 Requests.get showing different HTML than Chrome's Developer Tool

问题 I am working on a web scraping tool using python (specifically jupyter notebook) that scrapes a few real estate pages and saves the data like price, adress etc. It is working just fine for one of the pages I picked out but when I try to scrape this page: sreality.cz (sorry, the page is in Czech but the actual content is not that important now) using reguests.get() I get this result: <!doctype html> <html lang="{{ html.lang }}" ng-app="sreality" ng-controller="MainCtrl"> <head> <meta charset=

Why does parsing XML document using MSXML v3.0 work, but MSXML v6.0 doesn't

阅读更多关于 Why does parsing XML document using MSXML v3.0 work, but MSXML v6.0 doesn't

问题 So, I am working on a project that scrapes and collects data from many different sources around the internet with many different methods depending on each source's characteristics. The most recent addition is a web API call which returns the following XML as a response: <?xml version="1.0"?> <Publication_MarketDocument xmlns="urn:iec62325.351:tc57wg16:451-3:publicationdocument:7:0"> <mRID>29b526a69b9445a7bb507ba446e3e8f9</mRID> <revisionNumber>1</revisionNumber> <type>A44</type> <sender