screen-scraping | 易学教程

Can't extract the text and find all by BeautifulSoup

阅读更多关于 Can't extract the text and find all by BeautifulSoup

问题 I want to extract the all the available items in the équipements, but I can only get the first four items, and then I got '+ plus'. import urllib2 from bs4 import BeautifulSoup import re import requests headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} url = 'https://www.airbnb.fr/rooms/8261637?s=bAMrFL5A' req = urllib2.Request(url = url, headers = headers) html = urllib2.urlopen(req) bsobj = BeautifulSoup(html.read(),'lxml') b

I want to scrape a site using GAE and post the results into a Google Entity

阅读更多关于 I want to scrape a site using GAE and post the results into a Google Entity

问题 I want to scrape this URL : https://www.xstreetsl.com/modules.php?searchSubmitImage_x=0&searchSubmitImage_y=0&SearchLocale=0&name=Marketplace&SearchKeyword=business&searchSubmitImage.x=0&searchSubmitImage.y=0&SearchLocale=0&SearchPriceMin=&SearchPriceMax=&SearchRatingMin=&SearchRatingMax=&sort=&dir=asc Go into each of the links and extract out various pieces of information e.g. permissions, prims etc then post the results into a Entity on google app engine. I want to know the best way to go

Sending form data to aspx page

阅读更多关于 Sending form data to aspx page

问题 There is a need to do a search on the website url = r'http://www.cpso.on.ca/docsearch/' this is an aspx page (I'm beginning this trek as of yesterday, sorry for noob questions) using BeautifulSoup, I can get the __VIEWSTATE and __EVENTVALIDATION like this: viewstate = soup.find('input', {'id' : '__VIEWSTATE'})['value'] eventval = soup.find('input', {'id' : '__EVENTVALIDATION'})['value'] and the header can be set like this: headers = {'HTTP_USER_AGENT': 'Mozilla/5.0 (Windows; U; Windows NT 5.1

How to protect/monitor your site from crawling by malicious user

阅读更多关于 How to protect/monitor your site from crawling by malicious user

问题 Situation: Site with content protected by username/password (not all controlled since they can be trial/test users) a normal search engine can't get at it because of username/password restrictions a malicious user can still login and pass the session cookie to a "wget -r" or something else. The question would be what is the best solution to monitor such activity and respond to it (considering the site policy is no-crawling/scraping allowed) I can think of some options: Set up some traffic

Web scraping, screen scraping, data mining tips? [closed]

阅读更多关于 Web scraping, screen scraping, data mining tips? [closed]

问题 As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 7 years ago . I'm working on a project and I need to do a lot of screen scraping to get a lot of data as fast as possible. I'm wondering if anyone

Web scraping, screen scraping, data mining tips? [closed]

阅读更多关于 Web scraping, screen scraping, data mining tips? [closed]

Scraping and parsing Google search results using Python

阅读更多关于 Scraping and parsing Google search results using Python

问题 I asked a question on realizing a general idea to crawl and save webpages. Part of the original question is: how to crawl and save a lot of "About" pages from the Internet. With some further research, I got some choices to go ahead with both on scraping and parsing (listed at the bottom). Today, I ran into another Ruby discussion about how to scrape from Google search results. This provides a great alternative for my problem which will save all the effort on the crawling part. The new

Unable to load ASP.NET page using Python urllib2

阅读更多关于 Unable to load ASP.NET page using Python urllib2

问题 I am trying to do a POST request to https://www.paoilandgasreporting.state.pa.us/publicreports/Modules/WellDetails/WellDetails.aspx in order to scrape data. Here is my current code: from urllib import urlencode import urllib2 # Configuration uri = 'https://www.paoilandgasreporting.state.pa.us/publicreports/Modules/WellDetails/WellDetails.aspx' headers = { 'HTTP_USER_AGENT': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.13) Gecko/2009073022 Firefox/3.0.13', 'HTTP_ACCEPT':

Issue scraping page with “Load more” button with rvest

阅读更多关于 Issue scraping page with “Load more” button with rvest

问题 I want to obtain the links to the atms listed on this page: https://coinatmradar.com/city/345/bitcoin-atm-birmingham-uk/ Would I need to do something about the 'load more' button at the bottom of the page? I have been using the selector tool you can download for chrome to select the CSS path. I've written the below code block and it only seems to retrieve the first ten links. library(rvest) base <- "https://coinatmradar.com/city/345/bitcoin-atm-birmingham-uk/" base_read <- read_html(base) atm

VBA to login in zerodha account and then download and again upload live data for buy and sell signal

阅读更多关于 VBA to login in zerodha account and then download and again upload live data for buy and sell signal

问题 I want my VBA to: Login into Zerodha kite account. Integrate API. Then download the live data from there and after analysing the data, it should upload data for buy or sell option. I tried to log on to Zerodha account, but it is just refusing my request, so I can't do anything. Sub Test() Set ie = CreateObject("InternetExplorer.application") ie.Visible = True ie.Navigate ("https://https://kite.zerodha.com/" & ActiveCell) Do If ie.ReadyState = 4 Then ie.Visible = False Exit Do Else DoEvents