scrape | 易学教程

Scrape a website (javascript website) using php

阅读更多关于 Scrape a website (javascript website) using php

I am trying to scrape a website (believe it is in JavaScript) using a simple PHP script. I am a beginner so any help would be greatly appreciated. The URL of the webpage is: http://www.indiainfoline.com/Markets/Company/Fundamentals/Balance-Sheet/Yes-Bank-Ltd/532648 So here for example I would like to pass the name of company (Yes-Bank-Ltd) and code (532648) in get_file_contents. Not sure on how to do it so can somebody please help. Thanks, Nidhi Why aren't you just not append the string of the company and code in the url. Here is an idea that you fill up an array of company and code (need to

scraping xml/javascript table with R [closed]

阅读更多关于 scraping xml/javascript table with R [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 5 years ago . I want to scrape a table like this http://www.oddsportal.com//hockey/usa/nhl/carolina-hurricanes-ottawa-senators-80YZhBGC/ I'd want to scrape the bookmakers and the odds. The problem is I don't know what kind of a table that is nor how to scrape it. These threads might be able to help me (Scraping javascript

How do I scrape an https page? [duplicate]

阅读更多关于 How do I scrape an https page? [duplicate]

This question already has answers here : Python Requests throwing SSLError (22 answers) Closed 5 years ago . I'm using a python script with 'lxml' and 'requests' to scrape a web page. My goal is to grab an element from a page and download it, but the content is on an HTTPS page and I'm getting an error when trying to access the stuff in the page. I'm sure there is some kind of certificate or authentication I have to include, but I'm struggling to find the right resources. I'm using: page = requests.get("https://[example-page.com]", auth=('[username]','[password]')) and the error is: requests

Error parsing query with XSoup

阅读更多关于 Error parsing query with XSoup

I'm trying to parse an html page using xsoup. This is my code: Document doc = Jsoup.connect("http://appsvr.mardelplata.gob.ar/Consultas07/OrdenesDeCompra/OC/index.asp?fmANIO_CON=2015&fmJURISDICCION_CON=1110200000&fmTIPOCONT_CON=--&fmNRO_OC=&Consultar=Consultar").get(); List<String> filasFiltradas = Xsoup.compile("//div[@id='listado_solicitudes'][//tr[@bgcolor='#EFF5FE' or @bgcolor='#DDEEFF'] | //div[@class='subtitle']]").evaluate(doc).list(); I tested the xpath code with Chrome's "Xpath Helper" extension and it works great, but when I run the code it throws this error: Exception in thread

Using Tor + Privoxy to scrape google shopping results: How to avoid block?

阅读更多关于 Using Tor + Privoxy to scrape google shopping results: How to avoid block?

问题 I have installed Tor + Privoxy on my server and they're working fine! (Tested). But now when I try to use urllib2 (python) to scrape google shopping results, using proxy of course, I always get blocked by google (sometimes 503 error, sometimes 403 error). So anyone have any solutions can help me avoid that problem? It would be very appreciated!! The source code that I am using: _HEADERS = { 'User-Agent': 'Mozilla/5.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

Using SoupStrainer to parse selectively

阅读更多关于 Using SoupStrainer to parse selectively

问题 Im trying to parse a list of video game titles from a shopping site. however as the item list is all stored inside a tag . This section of the documentation supposedly explains how to parse only part of the document but i cant work it out. my code: from BeautifulSoup import BeautifulSoup import urllib import re url = "Some Shopping Site" html = urllib.urlopen(url).read() soup = BeautifulSoup(html) for a in soup.findAll('a',{'title':re.compile('.+') }): print a.string at present is prints the

how to login through php curl which is submited by javascript i.e no submit button in form

阅读更多关于 how to login through php curl which is submited by javascript i.e no submit button in form

问题 I am trying to login a secure https website thorugh curl . my code is running successfully for other sites but some website where form is submitting through javascript its not working . currently i am using the following code for curl <? # Define target page $target = "https://www.domainname.com/login.jsf"; # Define the login form data $form_data="enter=Enter&username=webbot&password=sp1der3"; # Create the cURL session $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $target); // Define

How to download images from BeautifulSoup?

阅读更多关于 How to download images from BeautifulSoup?

问题 Image http://i.imgur.com/OigSBjF.png import requests from bs4 import BeautifulSoup r = requests.get("xxxxxxxxx") soup = BeautifulSoup(r.content) for link in links: if "http" in link.get('src'): print link.get('src') I get the printed URL but don't know how to work with it. 回答1: You need to download and write to disk: import requests from os.path import basename r = requests.get("xxx") soup = BeautifulSoup(r.content) for link in links: if "http" in link.get('src'): lnk = link.get('src') with

Importing URLs for JSOUP to Scrape via Spreadsheet

阅读更多关于 Importing URLs for JSOUP to Scrape via Spreadsheet

I finally got IntelliJ to work. I'm using the code below. It works perfect. I need it to loop over and over and pull links from a spreadsheet to find the price over and over again on different items. I have a spreadsheet with a few sample URLs located in column C starting at row 2. How can I have JSOUP use the URLs in this spreadsheet then output to column D? public class Scraper { public static void main(String[] args) throws Exception { final Document document = Jsoup.connect("examplesite.com").get(); for (Element row : document.select("#price")) { final String price = row.select("#price")

Jsoup cookie authentication from cookiesyncmanager to scrape from https site

阅读更多关于 Jsoup cookie authentication from cookiesyncmanager to scrape from https site

I have an android application using a webview on which the user has to log in with username and password before being redirected to the page i would like to scrape data off with jsoup. Since the jsoup thread would be a different session the user would have to login again. Now i would like to use the cookie received from the webview to send with the jsoup request to be able to scrape my data. The cookie is being synced with cookiesyncmanager with following code. This is basically where I am stuck cause i dont know how to read out the cookie nor how to attach it to the jsoup request. Please help