screen-scraping | 易学教程

How to use cURL to POST form values on form using JS

阅读更多关于 How to use cURL to POST form values on form using JS

问题 * Sorry for the long post * I'm using cURL in PHP to post some form fields in effort to return the result of the post Need some help as the form is somewhat unusual. cURL Script $ch = curl_init(); $data = array('field_1_name' => 'field_value', 'field_2_name' => 'field_value', 'field_3_name' => 'field_value', ); curl_setopt($ch, CURLOPT_URL,'http://url.com'); curl_setopt ($ch, CURLOPT_POST, 1); curl_setopt ($ch, CURLOPT_POSTFIELDS, $data); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); $fp =

Having Trouble Using CURL and PHP to Get Google Search Results Through a Proxy

阅读更多关于 Having Trouble Using CURL and PHP to Get Google Search Results Through a Proxy

问题 This script works fine when getting google.com but not with google.com/search?q=test. When I don't use CURLOPT_FOLLOWLOCATION, I get a 302 Moved. When I do use it, I get a page asking me to input a captcha. I've tried several different U.S. based proxies and have varied the user agent string. Is there something I'm missing here? function my_fetch($url,$proxy,$user_agent='Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8') { $ch = curl_init(); curl

web scraping a problem site

阅读更多关于 web scraping a problem site

问题 I'm trying to scrape some information from a web site, but am having trouble reading the relevant pages. The pages seem to first send a basic setup, then more detailed info. My download attempts only seem to capture the basic setup. I've tried urllib and mechanize so far. Firefox and Chrome have no trouble displaying the pages, although I can't see the parts I want when I view page source. A sample url is https://personal.vanguard.com/us/funds/snapshot?FundId=0542&FundIntExt=INT I'd like, for

Retryable FTP & HTTP URI reading with Typhoeus?

阅读更多关于 Retryable FTP & HTTP URI reading with Typhoeus?

问题 Having discussed some failure handling in "Does Ruby's 'open_uri' reliably close sockets after read or on fail?", I wanted to dig into this a little deeper. I'd like to attempt to pull data from an FTP server, then if that fails, attempt a pull from an http server. If these both fail, I'd like to cycle around and attempt a retry several times with a short pause in between the attempts (perhaps 1 second) I read about the "retryable" method in "Retrying code blocks in Ruby (on exceptions,

How to get everything between two HTML tags? (with XPath?)

阅读更多关于 How to get everything between two HTML tags? (with XPath?)

问题 EDIT : I've added a solution which works in this case. I want to extract a table from a page and I want to do this (probably) with a DOMDocument and XPath. But if you've got a better idea, tell me. My first attempt was this (obviously faulty, because it will get the first closing table tag): <?php $tableStart = strpos($source, '<table class="schedule"'); $tableEnd = strpos($source, '</table>', $tableStart); $rawTable = substr($source, $tableStart, ($tableEnd - $tableStart)); ?> I tough, this

Groovy htmlunit getFirstByXPath returning null + OCR Question

阅读更多关于 Groovy htmlunit getFirstByXPath returning null + OCR Question

问题 I have had a few issues with HtmlUnit returning nulls lately and am looking for guidance. each of my results for grabbing the first row of a website have returned null. I am wondering if someone can A) explain why they might be returning null B) explain better ways (if there are some) to go about getting the information Here is my current code (URL is in the source): client = new WebClient(BrowserVersion.FIREFOX_3) client.javaScriptEnabled = false def url = "http://www.hidemyass.com/proxy

Web Scraping, data mining, data extraction

阅读更多关于 Web Scraping, data mining, data extraction

问题 I am tasked with creating a web scraping software, and I don't know where to even begin. Any help would be appreciated, even just telling me how this data is organized, or what "type" of data layout the website is using would help, because I would be able to Google search that term. http://utilsub.lbs.ubc.ca/ion/default.aspx?dgm=x-pml:/diagrams/ud/Default/7330_FAC-delta_V2.4.1/7330_FAC-delta_V2.4.1-pq.dgm&node=Buildings.Angus_addition&logServerName=QUERYSERVER.UTIL2SUB&logServerHandle=327952

Get Mechanize to handle cookies from an arbitrary POST (to log into a website programmatically)

阅读更多关于 Get Mechanize to handle cookies from an arbitrary POST (to log into a website programmatically)

问题 I want to log into https://www.t-mobile.com/ programmatically. My first idea was to use Mechanize to submit the login form: alt text http://dl.dropbox.com/u/2792776/screenshots/2010-04-08_1440.png However, it turns out that this isn't even a real form. Instead, when you click "Log in" some javascript grabs the values of the fields, creates a new form dynamically, and submits it. "Log in" button HTML: <button onclick="handleLogin(); return false;" class="btnBlue" id="myTMobile-login"><span>Log

Logging into website with multiple pages using Python (urllib2 and cookielib)

阅读更多关于 Logging into website with multiple pages using Python (urllib2 and cookielib)

问题 I am writing a script to retrieve transaction information from my bank's home banking website for use in a personal mobile application. The website is laid out like so: https:/ /homebanking.purduefed.com/OnlineBanking/Login.aspx -> Enter username -> Submit form -> https:/ /homebanking.purduefed.com/OnlineBanking/AOP/Password.aspx -> Enter password -> Submit form -> https:/ /homebanking.purduefed.com/OnlineBanking/AccountSummary.aspx The problem I am having is since there are 2 separate pages

Scraping multiple table out of webpage in R

阅读更多关于 Scraping multiple table out of webpage in R

问题 I am trying to pull mutual funds data into R, My way of code works for single table but when there are multiple tables in a webpage, it doesn't work. Link - https://in.finance.yahoo.com/q/pm?s=115748.BO My Code url <- "https://in.finance.yahoo.com/q/pm?s=115748.BO" library(XML) perftable <- readHTMLTable(url, header = T, which = 1, stringsAsFactors = F) but i am getting an error message. Error in (function (classes, fdef, mtable) : unable to find an inherited method for function