screen-scraping

.Net Screen scraping and session

走远了吗. 提交于 2019-12-03 10:15:46
I am trying to screen scrape using C#.It works for few times,after which i receive Session expired error.Any help will be appreciated. Brett Allen Here is the set of classes I am using for screen scraping. (I wrote these classes, feel free to use however you want.) There may be some bugs in it, but every usage I have for it it works quite flawlessly. It also handles SSL websites fine, works with redirects, and captures the original pages that caused a redirect as well in the WebPage class. using System; using System.Collections.Generic; using System.IO; using System.Net; using System.Text;

Scrape a dynamic website

三世轮回 提交于 2019-12-03 09:14:49
问题 What is the best method to scrape a dynamic website where most of the content is generated by what appears to be ajax requests? I have previous experience with a Mechanize, BeautifulSoup, and python combo, but I am up for something new. --Edit-- For more detail: I'm trying to scrape the CNN primary database. There is a wealth of information there, but there doesn't appear to be an api. 回答1: This is a difficult problem because you either have to reverse engineer the javascript on a per-site

page scraping to get prices from google finance

瘦欲@ 提交于 2019-12-03 09:12:11
I am trying to get stock prices by scraping google finance pages, I am doing this in python, using urllib package and then using regex to get price data. When I leave my python script running, it works initially for some time (few minutes) and then starts throwing exception [HTTP Error 503: Service Unavailable] I guess this is happening because on web server side it detects frequent page updates as a robot and throws this exception after a while.. is there a way around this, i.e. deleting some cookie or creating some cookie etc.. or even better if google gives some api, I want to do this in

Reading and posting to web pages using C#

血红的双手。 提交于 2019-12-03 09:06:57
I have a project at work the requires me to be able to enter information into a web page, read the next page I get redirected to and then take further action. A simplified real-world example would be something like going to google.com, entering "Coding tricks" as search criteria, and reading the resulting page. Small coding examples like the ones linked to at http://www.csharp-station.com/HowTo/HttpWebFetch.aspx tell how to read a web page, but not how to interact with it by submitting information into a form and continuing on to the next page. For the record, I'm not building a malicious and

HTML Parsing - Get data from a table inside a div?

老子叫甜甜 提交于 2019-12-03 08:45:28
I am relatively new to the whole idea for HTML parsing/scraping. I was hoping that I could come here to get the help that I need! Basically what I am looking to do (i think), is specify the url of the page I wish to grab the data from. In this case - http://www.epgpweb.com/guild/us/Caelestrasz/Crimson/ From there, I want to grab the table class=listing in the div id=snapshot_table. I then wish to embed that table onto my own page and have it update when the original content is updated. I have read a few of the other posts on Google and Stackoverflow, I also had a look at a tutorial on Nettuts+

How to get content of a javascript/ajax -loaded div on a site?

自闭症网瘾萝莉.ら 提交于 2019-12-03 08:09:08
I have a PHP-script that loads page-content from another website by using CURL and simple_html_dom PHP library. This works great. If I echo out the HTML returned I can see the div-content there. However, if I try to select only that div with the simple_html_dom, the div always returned empty. At first I didn't know why. Now I know that it's because its content apparently is populated with javascript/ajax. How would I get the content of the site and then be able to select the div-content AFTER the javascript has populated it with the correct content? Is it even possible? Thanks! Yes its piece

Python WWW macro

天涯浪子 提交于 2019-12-03 07:23:21
i need something like iMacros for Python. It would be great to have something like that: browse_to('www.google.com') type_in_input('search', 'query') click_button('search') list = get_all('<p>') Do you know something like that? Thanks in advance, Etam. Almost a direct fulfillment of the wishes in the question - twill . twill is a simple language that allows users to browse the Web from a command-line interface. With twill, you can navigate through Web sites that use forms, cookies, and most standard Web features. twill supports automated Web testing and has a simple Python interface. (

Scraping data from all asp.net pages with AJAX pagination implemented

橙三吉。 提交于 2019-12-03 06:49:53
问题 I want to scrap a webpage containing a list of user with addresses, email etc. webpage contain list of user with pagination i.e. page contains 10 users when I click on page 2 link it will load users list form 2nd page via AJAX and update list so on for all pagination links. Website is developed in asp i.e. page with extension .aspx since I don't know anything about asp.net and how asp manages pagination and AJAX I am using simple html dom http://sourceforge.net/projects/simplehtmldom/ to

Extract video from .swf using Python

≯℡__Kan透↙ 提交于 2019-12-03 06:25:59
问题 I've written code that generated the links to videos such as the one below. Once obtained, I try to download it in this manner: import urllib.request import os url = 'http://www.videodetective.net/flash/players/?customerid=300120&playerid=351&publishedid=319113&playlistid=0&videokbrate=750&sub=RTO&pversion=5.2%22%20width=%22670%22%20height=%22360%22' response = urllib.request.urlopen(url).read() outpath = os.path.join(os.getcwd(), 'video.mp4') videofile = open(outpath , 'wb') videofile.write

Is there a library similar to lxml or nokogiri for Java? [closed]

谁说胖子不能爱 提交于 2019-12-03 06:22:32
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 5 years ago . I want to do some screen scraping, ideally using CSS selectors and not XPath. Is there a library similar to ones in Ruby or Python? 回答1: There are dozen of screen scraping library written in Java. Just to cite a few : TagSoup - a SAX-compliant parser written in Java that, instead of parsing well-formed or valid