screen-scraping | 易学教程

How to post ASP.NET login form using PHP/cURL?

阅读更多关于 How to post ASP.NET login form using PHP/cURL?

问题 I need to create a tool that will post a ASP.NET login form using PHP so that I can gather details from the user's summary page that is displayed after they are logged in. Because the site uses ASP.NET and the form has __VIEWSTATE and __EVENTVALIDATION hidden fields, as I understand it, I must get those values first, then submit them in the POST to the login form for this to work. I am new to PHP. The script that I have created should do the following: 1) GET the login form and grab _

Scraping basketball-reference.com in R (XML package not fully working)

阅读更多关于 Scraping basketball-reference.com in R (XML package not fully working)

I have been scraping various pages of basketball-ref for a while now in R with the XML package using "readHTMLtable" without any issues, but now I have one. When I try to scrape the splits section of a player's page, it only return the first line of the table not all. for example: URL="http://www.basketball-reference.com/players/j/jamesle01/splits/" tablefromURL = readHTMLTable(URL) table = tablefromURL[[1]] this gives me only one row in the table, the first one. I want all the rows however. I think the problem is that there are multiple headers in the table, but I'm not sure how to fix that.

How can i get IE credentials to use in my code?

阅读更多关于 How can i get IE credentials to use in my code?

问题 I'm currently developing an IE plugin using SpicIE. This plugin does some web scraping similar to the example posted on MSDN: WebRequest request = WebRequest.Create ("http://www.contoso.com/default.html"); request.Credentials = CredentialCache.DefaultCredentials; HttpWebResponse response = (HttpWebResponse)request.GetResponse (); Stream dataStream = response.GetResponseStream (); StreamReader reader = new StreamReader (dataStream); string responseFromServer = reader.ReadToEnd (); reader.Close

Iconv::IllegalSequence when using www::mechanize

阅读更多关于 Iconv::IllegalSequence when using www::mechanize

问题 I'm trying to do a little bit of webscraping, but the WWW:Mechanize gem doesn't seem to like the encoding and crashes. The post request results in a 302 redirect (which mechanize follows, so far so good) and the resulting page seems to crash it. I googled quite a bit, but nothing came up so far how to solve this. Any of you got an idea? Code: require 'rubygems' require 'mechanize' agent = WWW::Mechanize.new agent.user_agent_alias = 'Mac Safari' answer = agent.post('https://www.budget.de/de

How do I scrape data from a page that loads specific data after the main page load?

阅读更多关于 How do I scrape data from a page that loads specific data after the main page load?

问题 I have been using Ruby and Nokogiri to pull data from a URL similar to this one from the hollister website: http://www.hollisterco.com/webapp/wcs/stores/servlet/TrackDetail?storeId=10251&catalogId=10201&langId=-1&URL=TrackDetailView&orderNumber=1316358 My script looks like this right now: require 'rubygems' require 'nokogiri' require 'open-uri' page = Nokogiri::HTML(open("http://www.hollisterco.com/webapp/wcs/stores/servlet/TrackDetail?storeId=10251&catalogId=10201&langId=-1&URL

Scraping data from a secure website or automating mundane task

阅读更多关于 Scraping data from a secure website or automating mundane task

问题 I have a website where I need to login with username and password and captcha. Once in I have a control panel that has bookings. For each booking there is a link for a details page that has the email address of the person making the booking. Each day I need a list of all these email addresses to send an email to them. I know how to scrape sites in .NET to get these types of details but not for websites where I need to be logged in. I seen an article where I can pass the cookie as a header and

Need to scrape information from a webpage with a “show more” button, any recommendations?

阅读更多关于 Need to scrape information from a webpage with a “show more” button, any recommendations?

问题 Currently developing a "crawler" for educational reasons, Everything is working fine, i can extract url's & information & save it in a json file, everything is all fine and dandy... EXCEPT the page has a "load more" button that i NEED to interact with in order for the crawler to continue looking for more urls. This is where i could use you amazing guys & girls! Any recommendations on how to do this? I would like to interact with the "load more" button and re-send the HTML information to my

Python Mechanize Browser: HTTP Error 460

阅读更多关于 Python Mechanize Browser: HTTP Error 460

I am trying to log into a site using a mechanize browser and getting an HTTP 460 Error which appears to be a made up error so I'm not sure what to make of it. Here's the code: # Browser br = mechanize.Browser() # Cookie Jar cj = cookielib.LWPCookieJar() br.set_cookiejar(cj) # Browser options br.set_handle_equiv(True) br.set_handle_redirect(True) br.set_handle_referer(True) br.set_handle_robots(False) br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1) br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9

Getting source of a page after it's rendered in a templating engine?

阅读更多关于 Getting source of a page after it's rendered in a templating engine?

So I'm doing some screen scraping on a site that is very JS heavy. It uses a client side templating engine that renders all the content. I tried using jQuery and that worked in the console, but not on the server (Nodejs), obviously. I looked at a few libraries for Python and Java, and they seem to be able to handle what I want, but I would prefer a JS solution that works with a Node server. Is there any way to get the complete source of a page after it's rendered, using Node? I used jsdom for screen scrapping and the code goes here... var jsdom = require( 'jsdom' ); jsdom.env( { url: <give_url

Should I use Yahoo-Pipes to scrape the contents of a div?

阅读更多关于 Should I use Yahoo-Pipes to scrape the contents of a div?

Given: Url - http://www.contoso.com/search.php?q= {param} returns: -html- --body- {...} ---div id='foo'- ----div id='page1'/- ----div id='page2'/- ----div id='page3'/- ----div id='pageN'/- ---/div- {...} --/body- -/html- Wanted: The innerHtml of div id='foo' must be fetched by the client (i.e. Javascript). It will be split into discrete items (i.e. div id='page1' to div id='pageN'). API Throttling prevents server-side code from pre-fetching the data, so the parsing and manipulation burden must be placed on the client. Question: Could Yahoo-Pipes help format the data for easier consumption? The