screen-scraping | 易学教程

Iconv::IllegalSequence when using www::mechanize

阅读更多关于 Iconv::IllegalSequence when using www::mechanize

I'm trying to do a little bit of webscraping, but the WWW:Mechanize gem doesn't seem to like the encoding and crashes. The post request results in a 302 redirect (which mechanize follows, so far so good) and the resulting page seems to crash it. I googled quite a bit, but nothing came up so far how to solve this. Any of you got an idea? Code: require 'rubygems' require 'mechanize' agent = WWW::Mechanize.new agent.user_agent_alias = 'Mac Safari' answer = agent.post('https://www.budget.de/de/reservierung/privatkunden/step1/schnellbuchung', {"Country" => "Deutschland", "Abholstation" => "Aalen",

Can scraping be applied to this page which is actively recalculating?

阅读更多关于 Can scraping be applied to this page which is actively recalculating?

I would like to grab satellite positions from the page(s) below, but I'm not sure if scraping is appropriate because the page appears to be updating itself every second using some internal code (it keeps updating after I disconnect from the internet). Background information can be found in my question at Space Stackexchange: A nicer way to download the positions of the Orbcomm-2 satellites . I need a "snapshot" of four items simultaneously : UTC time latitude longitude altitude Right now I use screen shots and manual typing. Since these values are being updated by the page - is conventional

Scrape and generate RSS feed

阅读更多关于 Scrape and generate RSS feed

I use Simple HTML DOM to scrape a page for the latest news, and then generate an RSS feed using this PHP class . This what I have now: <?php // This is a minimum example of using the class include("FeedWriter.php"); include('simple_html_dom.php'); $html = file_get_html('http://www.website.com'); foreach($html->find('td[width="380"] p table') as $article) { $item['title'] = $article->find('span.title', 0)->innertext; $item['description'] = $article->find('.ingress', 0)->innertext; $item['link'] = $article->find('.lesMer', 0)->href; $item['pubDate'] = $article->find('span.presseDato', 0)-

Raw HTML vs. DOM scraping in python using mechanize and beautiful soup

阅读更多关于 Raw HTML vs. DOM scraping in python using mechanize and beautiful soup

I am attempting to write a program that, as an example, will scrape the top price off of this web page: http://www.kayak.com/#/flights/JFK-PAR/2012-06-01/2012-07-01/1adults First, I am easily able to retrieve the HTML by doing the following: from urllib import urlopen from BeautifulSoup import BeautifulSoup import mechanize webpage = 'http://www.kayak.com/#/flights/JFK-PAR/2012-06-01/2012-07-01/1adults' br = mechanize.Browser() data = br.open(webpage).get_data() soup = BeautifulSoup(data) print soup However, the raw HTML does not contain the price. The browser does...it's thing (clarification

.Net Screen scraping and session

阅读更多关于 .Net Screen scraping and session

问题 I am trying to screen scrape using C#.It works for few times,after which i receive Session expired error.Any help will be appreciated. 回答1: Here is the set of classes I am using for screen scraping. (I wrote these classes, feel free to use however you want.) There may be some bugs in it, but every usage I have for it it works quite flawlessly. It also handles SSL websites fine, works with redirects, and captures the original pages that caused a redirect as well in the WebPage class. using

HTML Parsing - Get data from a table inside a div?

阅读更多关于 HTML Parsing - Get data from a table inside a div?

问题 I am relatively new to the whole idea for HTML parsing/scraping. I was hoping that I could come here to get the help that I need! Basically what I am looking to do (i think), is specify the url of the page I wish to grab the data from. In this case - http://www.epgpweb.com/guild/us/Caelestrasz/Crimson/ From there, I want to grab the table class=listing in the div id=snapshot_table. I then wish to embed that table onto my own page and have it update when the original content is updated. I have

Scraping data from a secure website or automating mundane task

阅读更多关于 Scraping data from a secure website or automating mundane task

I have a website where I need to login with username and password and captcha. Once in I have a control panel that has bookings. For each booking there is a link for a details page that has the email address of the person making the booking. Each day I need a list of all these email addresses to send an email to them. I know how to scrape sites in .NET to get these types of details but not for websites where I need to be logged in. I seen an article where I can pass the cookie as a header and that should do the trick but that would require me to view the cookie in firebug and copy and paste it

page scraping to get prices from google finance

阅读更多关于 page scraping to get prices from google finance

问题 I am trying to get stock prices by scraping google finance pages, I am doing this in python, using urllib package and then using regex to get price data. When I leave my python script running, it works initially for some time (few minutes) and then starts throwing exception [HTTP Error 503: Service Unavailable] I guess this is happening because on web server side it detects frequent page updates as a robot and throws this exception after a while.. is there a way around this, i.e. deleting

UnicodeEncodeError with BeautifulSoup 3.1.0.1 and Python 2.5.2

阅读更多关于 UnicodeEncodeError with BeautifulSoup 3.1.0.1 and Python 2.5.2

With BeautifulSoup 3.1.0.1 and Python 2.5.2, and trying to parse a web page in French. However, as soon as I call findAll, I get the following error: UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1146: ordinal not in range(128) Below is the code I am currently running: import urllib2 from BeautifulSoup import BeautifulSoup page = urllib2.urlopen("http://fr.encarta.msn.com/encyclopedia_761561798/Paris.html") soup = BeautifulSoup(page, fromEncoding="latin1") r = soup.findAll("table") print r Does anybody have an idea why? Thanks! UPDATE : As resquested, below is

Scrape using multiple POST data from the same URL

阅读更多关于 Scrape using multiple POST data from the same URL

I have already created one spider that collects a list of company names with matching phone numbers. This is then saved to a CSV file. I am then wanting to scrape data from another site using the phones numbers in the CSV file as POST data. I am wanting it to loop through the same start URL but just scraping the data that each phone number produces until there are no more numbers left in the CSV file. This is what I have got so far: from scrapy.spider import BaseSpider from scrapy.http import Request from scrapy.http import FormRequest from scrapy.selector import HtmlXPathSelector from scrapy