screen-scraping

How to use cURL to POST form values on form using JS

江枫思渺然 提交于 2019-12-08 13:03:53
问题 * Sorry for the long post * I'm using cURL in PHP to post some form fields in effort to return the result of the post Need some help as the form is somewhat unusual. cURL Script $ch = curl_init(); $data = array('field_1_name' => 'field_value', 'field_2_name' => 'field_value', 'field_3_name' => 'field_value', ); curl_setopt($ch, CURLOPT_URL,'http://url.com'); curl_setopt ($ch, CURLOPT_POST, 1); curl_setopt ($ch, CURLOPT_POSTFIELDS, $data); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); $fp =

Having Trouble Using CURL and PHP to Get Google Search Results Through a Proxy

纵然是瞬间 提交于 2019-12-08 10:45:12
问题 This script works fine when getting google.com but not with google.com/search?q=test. When I don't use CURLOPT_FOLLOWLOCATION, I get a 302 Moved. When I do use it, I get a page asking me to input a captcha. I've tried several different U.S. based proxies and have varied the user agent string. Is there something I'm missing here? function my_fetch($url,$proxy,$user_agent='Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8') { $ch = curl_init(); curl

web scraping a problem site

你离开我真会死。 提交于 2019-12-08 10:44:26
问题 I'm trying to scrape some information from a web site, but am having trouble reading the relevant pages. The pages seem to first send a basic setup, then more detailed info. My download attempts only seem to capture the basic setup. I've tried urllib and mechanize so far. Firefox and Chrome have no trouble displaying the pages, although I can't see the parts I want when I view page source. A sample url is https://personal.vanguard.com/us/funds/snapshot?FundId=0542&FundIntExt=INT I'd like, for

Retryable FTP & HTTP URI reading with Typhoeus?

戏子无情 提交于 2019-12-08 09:12:33
问题 Having discussed some failure handling in "Does Ruby's 'open_uri' reliably close sockets after read or on fail?", I wanted to dig into this a little deeper. I'd like to attempt to pull data from an FTP server, then if that fails, attempt a pull from an http server. If these both fail, I'd like to cycle around and attempt a retry several times with a short pause in between the attempts (perhaps 1 second) I read about the "retryable" method in "Retrying code blocks in Ruby (on exceptions,

How to get everything between two HTML tags? (with XPath?)

耗尽温柔 提交于 2019-12-08 07:02:36
问题 EDIT : I've added a solution which works in this case. I want to extract a table from a page and I want to do this (probably) with a DOMDocument and XPath. But if you've got a better idea, tell me. My first attempt was this (obviously faulty, because it will get the first closing table tag): <?php $tableStart = strpos($source, '<table class="schedule"'); $tableEnd = strpos($source, '</table>', $tableStart); $rawTable = substr($source, $tableStart, ($tableEnd - $tableStart)); ?> I tough, this

Groovy htmlunit getFirstByXPath returning null + OCR Question

不想你离开。 提交于 2019-12-08 06:54:49
问题 I have had a few issues with HtmlUnit returning nulls lately and am looking for guidance. each of my results for grabbing the first row of a website have returned null. I am wondering if someone can A) explain why they might be returning null B) explain better ways (if there are some) to go about getting the information Here is my current code (URL is in the source): client = new WebClient(BrowserVersion.FIREFOX_3) client.javaScriptEnabled = false def url = "http://www.hidemyass.com/proxy

Web Scraping, data mining, data extraction

一个人想着一个人 提交于 2019-12-08 06:52:59
问题 I am tasked with creating a web scraping software, and I don't know where to even begin. Any help would be appreciated, even just telling me how this data is organized, or what "type" of data layout the website is using would help, because I would be able to Google search that term. http://utilsub.lbs.ubc.ca/ion/default.aspx?dgm=x-pml:/diagrams/ud/Default/7330_FAC-delta_V2.4.1/7330_FAC-delta_V2.4.1-pq.dgm&node=Buildings.Angus_addition&logServerName=QUERYSERVER.UTIL2SUB&logServerHandle=327952

Get Mechanize to handle cookies from an arbitrary POST (to log into a website programmatically)

孤街浪徒 提交于 2019-12-08 05:38:11
问题 I want to log into https://www.t-mobile.com/ programmatically. My first idea was to use Mechanize to submit the login form: alt text http://dl.dropbox.com/u/2792776/screenshots/2010-04-08_1440.png However, it turns out that this isn't even a real form. Instead, when you click "Log in" some javascript grabs the values of the fields, creates a new form dynamically, and submits it. "Log in" button HTML: <button onclick="handleLogin(); return false;" class="btnBlue" id="myTMobile-login"><span>Log

Logging into website with multiple pages using Python (urllib2 and cookielib)

谁说胖子不能爱 提交于 2019-12-08 05:02:36
问题 I am writing a script to retrieve transaction information from my bank's home banking website for use in a personal mobile application. The website is laid out like so: https:/ /homebanking.purduefed.com/OnlineBanking/Login.aspx -> Enter username -> Submit form -> https:/ /homebanking.purduefed.com/OnlineBanking/AOP/Password.aspx -> Enter password -> Submit form -> https:/ /homebanking.purduefed.com/OnlineBanking/AccountSummary.aspx The problem I am having is since there are 2 separate pages

Scraping multiple table out of webpage in R

╄→гoц情女王★ 提交于 2019-12-08 04:43:00
问题 I am trying to pull mutual funds data into R, My way of code works for single table but when there are multiple tables in a webpage, it doesn't work. Link - https://in.finance.yahoo.com/q/pm?s=115748.BO My Code url <- "https://in.finance.yahoo.com/q/pm?s=115748.BO" library(XML) perftable <- readHTMLTable(url, header = T, which = 1, stringsAsFactors = F) but i am getting an error message. Error in (function (classes, fdef, mtable) : unable to find an inherited method for function