web-scraping | 易学教程

Scrapy simulate XHR request - returning 400

阅读更多关于 Scrapy simulate XHR request - returning 400

问题 I'm trying to get data from a site using Ajax. The page loads and then Javascript requests the content. See this page for details: https://www.tele2.no/mobiltelefon.aspx The problem is that when i try to simulate this process by calling this url: https://www.tele2.no/Services/Webshop/FilterService.svc/ApplyPhoneFilters I get a 400 response telling me that the request is not allowed. This is my code: # -*- coding: utf-8 -*- import scrapy import json class Tele2Spider(scrapy.Spider): name =

Scrapy simulate XHR request - returning 400

阅读更多关于 Scrapy simulate XHR request - returning 400

Google scrapping using python - requests: How to avoid being blocked due to many requests?

阅读更多关于 Google scrapping using python - requests: How to avoid being blocked due to many requests?

问题 For a school project I need get the web addresses of 200 companies (based on a list). My script is working fine, but when I'm around the company 80, I get blocked by google. This is the message that I'm getting. > Our systems have detected unusual traffic from your computer network. > This page checks to see if it's really you sending the requests, and > not a robot. <a href="#" > onclick="document.getElementById('infoDiv').style.display='block' I tried two different ways to get my data: A

Get web page content (Not from source code) [duplicate]

阅读更多关于 Get web page content (Not from source code) [duplicate]

问题 This question already has answers here : Web-scraping JavaScript page with Python (15 answers) Closed 4 years ago . I want to get the rainfall data of each day from here. When I am in inspect mode , I can see the data. However, when I view the source code, I cannot find it. I am using urllib2 and BeautifulSoup from bs4 Here is my code: import urllib2 from bs4 import BeautifulSoup link = "http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=2015&m=1" r = urllib2.urlopen(link) soup = BeautifulSoup(r)

Get web page content (Not from source code) [duplicate]

阅读更多关于 Get web page content (Not from source code) [duplicate]

Best way to download all images from a site using Java? Currently getting an 403 Status Error

阅读更多关于 Best way to download all images from a site using Java? Currently getting an 403 Status Error

问题 I am trying to download all the images off of a site, but I'm not sure if this is the best way, as I have tried setting a user agent and referrer to no avail. The 403 Status Error only occurs when trying to download the images from the src page, while the page that has all the images in one place is doesn't show any errors and sends the src to the images. I am not sure if there is a way to download the images without visiting the src page? Or a better way to do this entirely. Here is my code

Why is BeautifulSoup's findAll returning an empty list when I search by class?

阅读更多关于 Why is BeautifulSoup's findAll returning an empty list when I search by class?

问题 I am trying to web-scrape using an h2 tag, but BeautifulSoup returns an empty list. <h2 class="iCIMS_InfoMsg iCIMS_InfoField_Job"> html=urlopen("https://careersus-endologix.icims.com/jobs/2034/associate-supplier-quality-engineer/job") bs0bj=BeautifulSoup(html,"lxml") nameList=bs0bj.findAll("h2",{"class":"iCIMS_InfoMsg iCIMS_InfoField_Job"}) print(nameList) 回答1: The content is inside an iframe and updated via js (so not present in initial request). You can use the same link the page is using

Why is BeautifulSoup's findAll returning an empty list when I search by class?

阅读更多关于 Why is BeautifulSoup's findAll returning an empty list when I search by class?

Why is BeautifulSoup's findAll returning an empty list when I search by class?

阅读更多关于 Why is BeautifulSoup's findAll returning an empty list when I search by class?

How to iterate over divs in Scrapy?

阅读更多关于 How to iterate over divs in Scrapy?

问题 It is propably very trivial question but I am new to Scrapy. I've tried to find solution for my problem but I just can't see what is wrong with this code. My goal is to scrap all of the opera shows from given website. Data for every show is inside one div with class "row-fluid row-performance ". I am trying to iterate over them to retrieve it but it doesn't work. It gives me content of the first div in each iteration(I am getting 19x times the same show, instead of different items). Thanks