screen-scraping | 易学教程

.Net Screen scraping and session

阅读更多关于 .Net Screen scraping and session

I am trying to screen scrape using C#.It works for few times,after which i receive Session expired error.Any help will be appreciated. Brett Allen Here is the set of classes I am using for screen scraping. (I wrote these classes, feel free to use however you want.) There may be some bugs in it, but every usage I have for it it works quite flawlessly. It also handles SSL websites fine, works with redirects, and captures the original pages that caused a redirect as well in the WebPage class. using System; using System.Collections.Generic; using System.IO; using System.Net; using System.Text;

Scrape a dynamic website

阅读更多关于 Scrape a dynamic website

问题 What is the best method to scrape a dynamic website where most of the content is generated by what appears to be ajax requests? I have previous experience with a Mechanize, BeautifulSoup, and python combo, but I am up for something new. --Edit-- For more detail: I'm trying to scrape the CNN primary database. There is a wealth of information there, but there doesn't appear to be an api. 回答1: This is a difficult problem because you either have to reverse engineer the javascript on a per-site

page scraping to get prices from google finance

阅读更多关于 page scraping to get prices from google finance

I am trying to get stock prices by scraping google finance pages, I am doing this in python, using urllib package and then using regex to get price data. When I leave my python script running, it works initially for some time (few minutes) and then starts throwing exception [HTTP Error 503: Service Unavailable] I guess this is happening because on web server side it detects frequent page updates as a robot and throws this exception after a while.. is there a way around this, i.e. deleting some cookie or creating some cookie etc.. or even better if google gives some api, I want to do this in

Reading and posting to web pages using C#

阅读更多关于 Reading and posting to web pages using C#

I have a project at work the requires me to be able to enter information into a web page, read the next page I get redirected to and then take further action. A simplified real-world example would be something like going to google.com, entering "Coding tricks" as search criteria, and reading the resulting page. Small coding examples like the ones linked to at http://www.csharp-station.com/HowTo/HttpWebFetch.aspx tell how to read a web page, but not how to interact with it by submitting information into a form and continuing on to the next page. For the record, I'm not building a malicious and

HTML Parsing - Get data from a table inside a div?

阅读更多关于 HTML Parsing - Get data from a table inside a div?

I am relatively new to the whole idea for HTML parsing/scraping. I was hoping that I could come here to get the help that I need! Basically what I am looking to do (i think), is specify the url of the page I wish to grab the data from. In this case - http://www.epgpweb.com/guild/us/Caelestrasz/Crimson/ From there, I want to grab the table class=listing in the div id=snapshot_table. I then wish to embed that table onto my own page and have it update when the original content is updated. I have read a few of the other posts on Google and Stackoverflow, I also had a look at a tutorial on Nettuts+

How to get content of a javascript/ajax -loaded div on a site?

阅读更多关于 How to get content of a javascript/ajax -loaded div on a site?

I have a PHP-script that loads page-content from another website by using CURL and simple_html_dom PHP library. This works great. If I echo out the HTML returned I can see the div-content there. However, if I try to select only that div with the simple_html_dom, the div always returned empty. At first I didn't know why. Now I know that it's because its content apparently is populated with javascript/ajax. How would I get the content of the site and then be able to select the div-content AFTER the javascript has populated it with the correct content? Is it even possible? Thanks! Yes its piece

Python WWW macro

阅读更多关于 Python WWW macro

i need something like iMacros for Python. It would be great to have something like that: browse_to('www.google.com') type_in_input('search', 'query') click_button('search') list = get_all('<p>') Do you know something like that? Thanks in advance, Etam. Almost a direct fulfillment of the wishes in the question - twill . twill is a simple language that allows users to browse the Web from a command-line interface. With twill, you can navigate through Web sites that use forms, cookies, and most standard Web features. twill supports automated Web testing and has a simple Python interface. (

Scraping data from all asp.net pages with AJAX pagination implemented

阅读更多关于 Scraping data from all asp.net pages with AJAX pagination implemented

问题 I want to scrap a webpage containing a list of user with addresses, email etc. webpage contain list of user with pagination i.e. page contains 10 users when I click on page 2 link it will load users list form 2nd page via AJAX and update list so on for all pagination links. Website is developed in asp i.e. page with extension .aspx since I don't know anything about asp.net and how asp manages pagination and AJAX I am using simple html dom http://sourceforge.net/projects/simplehtmldom/ to

Extract video from .swf using Python

阅读更多关于 Extract video from .swf using Python

问题 I've written code that generated the links to videos such as the one below. Once obtained, I try to download it in this manner: import urllib.request import os url = 'http://www.videodetective.net/flash/players/?customerid=300120&playerid=351&publishedid=319113&playlistid=0&videokbrate=750&sub=RTO&pversion=5.2%22%20width=%22670%22%20height=%22360%22' response = urllib.request.urlopen(url).read() outpath = os.path.join(os.getcwd(), 'video.mp4') videofile = open(outpath , 'wb') videofile.write

Is there a library similar to lxml or nokogiri for Java? [closed]

阅读更多关于 Is there a library similar to lxml or nokogiri for Java? [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 5 years ago . I want to do some screen scraping, ideally using CSS selectors and not XPath. Is there a library similar to ones in Ruby or Python? 回答1: There are dozen of screen scraping library written in Java. Just to cite a few : TagSoup - a SAX-compliant parser written in Java that, instead of parsing well-formed or valid