screen-scraping | 易学教程

How can I screen scrape with Perl?

阅读更多关于 How can I screen scrape with Perl?

问题 I need to display some values that are stored in a website, for that I need to scrape the website and fetch the content from the table. Any ideas? 回答1: If you are familiar with jQuery you might want to check out pQuery, which makes this very easy: ## print every <h2> tag in page use pQuery; pQuery("http://google.com/search?q=pquery") ->find("h2") ->each(sub { my $i = shift; print $i + 1, ") ", pQuery($_)->text, "\n"; }); There's also HTML::DOM. Whatever you do, though, don't use regular

Screen scraping: getting around “HTTP Error 403: request disallowed by robots.txt”

阅读更多关于 Screen scraping: getting around “HTTP Error 403: request disallowed by robots.txt”

问题 Is there a way to get around the following? httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt Is the only way around this to contact the site-owner (barnesandnoble.com).. i'm building a site that would bring them more sales, not sure why they would deny access at a certain depth. I'm using mechanize and BeautifulSoup on Python2.6. hoping for a work-around 回答1: You can try lying about your user agent (e.g., by trying to make believe you're a human being and not a robot)

Python Scraping JavaScript using Selenium and Beautiful Soup

阅读更多关于 Python Scraping JavaScript using Selenium and Beautiful Soup

问题 I'm trying to scrape a JavaScript enables page using BS and Selenium. I have the following code so far. It still doesn't somehow detect the JavaScript (and returns a null value). In this case I'm trying to scrape the Facebook comments in the bottom. (Inspect element shows the class as postText) Thanks for the help! from selenium import webdriver from selenium.common.exceptions import NoSuchElementException from selenium.webdriver.common.keys import Keys import BeautifulSoup browser =

Scraping javascript website in R

阅读更多关于 Scraping javascript website in R

问题 I want to scrape the match time and date from this url: http://www.scoreboard.com/game/rosol-l-goffin-d-2014/8drhX07d/#game-summary By using the chrome dev tools, I can see this appears to be generated using the following code: <td colspan="3" id="utime" class="mstat-date">01:20 AM, October 29, 2014</td> But this is not in the source html. I think this is because its java (correct me if Im wrong). How can I scrape this information using R? 回答1: So, RSelenium is not the only answer (anymore).

Webbrowser behaviour issues

阅读更多关于 Webbrowser behaviour issues

问题 I am trying to automate Webbrowser with .NET C#. The issue is that the control or should I say IE browser behaves strange on different computers. For example, I am clickin on link and fillup a Ajax popup form on 1st computer like this, without any error: private void btn_Start_Click(object sender, RoutedEventArgs e) { webbrowserIE.Navigate("http://www.test.com/"); webbrowserIE.DocumentCompleted += fillup_LoadCompleted; } void fillup_LoadCompleted(object sender, System.Windows.Forms

What's the best approach for parsing XML/'screen scraping' in iOS? UIWebview or NSXMLParser?

阅读更多关于 What's the best approach for parsing XML/'screen scraping' in iOS? UIWebview or NSXMLParser?

问题 I am creating an iOS app that needs to get some data from a web page. My first though was to use NSXMLParser initWithContentsOfURL: and parse the HTML with the NSXMLParser delegate. However this approach seems like it could quickly become painful (if, for example, the HTML changed I would have to rewrite the parsing code which could be awkward). Seeing as I'm loading a web page I took take a look at UIWebView too. It looks like UIWebView may be the way to go.

Scrapy Python Set up User Agent

阅读更多关于 Scrapy Python Set up User Agent

问题 I tried to override the user-agent of my crawlspider by adding an extra line to the project configuration file. Here is the code: [settings] default = myproject.settings USER_AGENT = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36" [deploy] #url = http://localhost:6800/ project = myproject But when I run the crawler against my own web, I notice the spider did not pick up my customized user agent but the default one "Scrapy/0.18.2 (

Programmatic Python Browser with JavaScript

阅读更多关于 Programmatic Python Browser with JavaScript

问题 I want to screen-scrape a web-site that uses JavaScript. There is mechanize, the programmatic web browser for Python. However, it (understandably) doesn't interpret javascript. Is there any programmatic browser for Python which does? If not, is there any JavaScript implementation in Python that I could use to attempt to create one? 回答1: You might be better off using a tool like Selenium to automate the scraping using a web browser, so the JS executes and the page renders just like it would

How can I use Perl to grab text from a web page that is dynamically generated with JavaScript?

阅读更多关于 How can I use Perl to grab text from a web page that is dynamically generated with JavaScript?

问题 There is a website I am trying to pull information from in Perl, however the section of the page I need is being generated using javascript so all you see in the source is: <div id="results"></div> I need to somehow pull out the contents of that div and save it to a file using Perl/proxies/whatever. e.g. the information I want to save would be document.getElementById('results').innerHTML; I am not sure if this is possible or if anyone had any ideas or a way to do this. I was using a lynx

Click on a javascript link within python?

阅读更多关于 Click on a javascript link within python?

问题 I am navigating a site using python's mechanize module and having trouble clicking on a javascript link for next page. I did a bit of reading and people suggested I need python-spidermonkey and DOMforms. I managed to get them installed by I am not sure of the syntax to actually click on the link. I can identify the code on the page as: <a href="javascript:__doPostBack('ctl00$MainContent$gvSearchResults','Page$2')">2</a> Does anyone know how to click on it? or if perhaps there's another tool.