web-scraping

Scrapy simulate XHR request - returning 400

☆樱花仙子☆ 提交于 2021-02-08 06:59:22
问题 I'm trying to get data from a site using Ajax. The page loads and then Javascript requests the content. See this page for details: https://www.tele2.no/mobiltelefon.aspx The problem is that when i try to simulate this process by calling this url: https://www.tele2.no/Services/Webshop/FilterService.svc/ApplyPhoneFilters I get a 400 response telling me that the request is not allowed. This is my code: # -*- coding: utf-8 -*- import scrapy import json class Tele2Spider(scrapy.Spider): name =

Scrapy simulate XHR request - returning 400

梦想的初衷 提交于 2021-02-08 06:57:46
问题 I'm trying to get data from a site using Ajax. The page loads and then Javascript requests the content. See this page for details: https://www.tele2.no/mobiltelefon.aspx The problem is that when i try to simulate this process by calling this url: https://www.tele2.no/Services/Webshop/FilterService.svc/ApplyPhoneFilters I get a 400 response telling me that the request is not allowed. This is my code: # -*- coding: utf-8 -*- import scrapy import json class Tele2Spider(scrapy.Spider): name =

Google scrapping using python - requests: How to avoid being blocked due to many requests?

偶尔善良 提交于 2021-02-08 06:37:33
问题 For a school project I need get the web addresses of 200 companies (based on a list). My script is working fine, but when I'm around the company 80, I get blocked by google. This is the message that I'm getting. > Our systems have detected unusual traffic from your computer network. > This page checks to see if it's really you sending the requests, and > not a robot. <a href="#" > onclick="document.getElementById('infoDiv').style.display='block' I tried two different ways to get my data: A

Get web page content (Not from source code) [duplicate]

痞子三分冷 提交于 2021-02-08 03:55:22
问题 This question already has answers here : Web-scraping JavaScript page with Python (15 answers) Closed 4 years ago . I want to get the rainfall data of each day from here. When I am in inspect mode , I can see the data. However, when I view the source code, I cannot find it. I am using urllib2 and BeautifulSoup from bs4 Here is my code: import urllib2 from bs4 import BeautifulSoup link = "http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=2015&m=1" r = urllib2.urlopen(link) soup = BeautifulSoup(r)

Get web page content (Not from source code) [duplicate]

别来无恙 提交于 2021-02-08 03:54:21
问题 This question already has answers here : Web-scraping JavaScript page with Python (15 answers) Closed 4 years ago . I want to get the rainfall data of each day from here. When I am in inspect mode , I can see the data. However, when I view the source code, I cannot find it. I am using urllib2 and BeautifulSoup from bs4 Here is my code: import urllib2 from bs4 import BeautifulSoup link = "http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=2015&m=1" r = urllib2.urlopen(link) soup = BeautifulSoup(r)

Best way to download all images from a site using Java? Currently getting an 403 Status Error

空扰寡人 提交于 2021-02-08 03:42:25
问题 I am trying to download all the images off of a site, but I'm not sure if this is the best way, as I have tried setting a user agent and referrer to no avail. The 403 Status Error only occurs when trying to download the images from the src page, while the page that has all the images in one place is doesn't show any errors and sends the src to the images. I am not sure if there is a way to download the images without visiting the src page? Or a better way to do this entirely. Here is my code

Why is BeautifulSoup's findAll returning an empty list when I search by class?

元气小坏坏 提交于 2021-02-08 03:16:13
问题 I am trying to web-scrape using an h2 tag, but BeautifulSoup returns an empty list. <h2 class="iCIMS_InfoMsg iCIMS_InfoField_Job"> html=urlopen("https://careersus-endologix.icims.com/jobs/2034/associate-supplier-quality-engineer/job") bs0bj=BeautifulSoup(html,"lxml") nameList=bs0bj.findAll("h2",{"class":"iCIMS_InfoMsg iCIMS_InfoField_Job"}) print(nameList) 回答1: The content is inside an iframe and updated via js (so not present in initial request). You can use the same link the page is using

Why is BeautifulSoup's findAll returning an empty list when I search by class?

二次信任 提交于 2021-02-08 03:12:01
问题 I am trying to web-scrape using an h2 tag, but BeautifulSoup returns an empty list. <h2 class="iCIMS_InfoMsg iCIMS_InfoField_Job"> html=urlopen("https://careersus-endologix.icims.com/jobs/2034/associate-supplier-quality-engineer/job") bs0bj=BeautifulSoup(html,"lxml") nameList=bs0bj.findAll("h2",{"class":"iCIMS_InfoMsg iCIMS_InfoField_Job"}) print(nameList) 回答1: The content is inside an iframe and updated via js (so not present in initial request). You can use the same link the page is using

Why is BeautifulSoup's findAll returning an empty list when I search by class?

戏子无情 提交于 2021-02-08 03:10:20
问题 I am trying to web-scrape using an h2 tag, but BeautifulSoup returns an empty list. <h2 class="iCIMS_InfoMsg iCIMS_InfoField_Job"> html=urlopen("https://careersus-endologix.icims.com/jobs/2034/associate-supplier-quality-engineer/job") bs0bj=BeautifulSoup(html,"lxml") nameList=bs0bj.findAll("h2",{"class":"iCIMS_InfoMsg iCIMS_InfoField_Job"}) print(nameList) 回答1: The content is inside an iframe and updated via js (so not present in initial request). You can use the same link the page is using

How to iterate over divs in Scrapy?

点点圈 提交于 2021-02-07 20:57:33
问题 It is propably very trivial question but I am new to Scrapy. I've tried to find solution for my problem but I just can't see what is wrong with this code. My goal is to scrap all of the opera shows from given website. Data for every show is inside one div with class "row-fluid row-performance ". I am trying to iterate over them to retrieve it but it doesn't work. It gives me content of the first div in each iteration(I am getting 19x times the same show, instead of different items). Thanks