screen-scraping

How can I screen scrape with Perl?

别来无恙 提交于 2019-12-17 22:25:20
问题 I need to display some values that are stored in a website, for that I need to scrape the website and fetch the content from the table. Any ideas? 回答1: If you are familiar with jQuery you might want to check out pQuery, which makes this very easy: ## print every <h2> tag in page use pQuery; pQuery("http://google.com/search?q=pquery") ->find("h2") ->each(sub { my $i = shift; print $i + 1, ") ", pQuery($_)->text, "\n"; }); There's also HTML::DOM. Whatever you do, though, don't use regular

Screen scraping: getting around “HTTP Error 403: request disallowed by robots.txt”

萝らか妹 提交于 2019-12-17 21:44:11
问题 Is there a way to get around the following? httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt Is the only way around this to contact the site-owner (barnesandnoble.com).. i'm building a site that would bring them more sales, not sure why they would deny access at a certain depth. I'm using mechanize and BeautifulSoup on Python2.6. hoping for a work-around 回答1: You can try lying about your user agent (e.g., by trying to make believe you're a human being and not a robot)

Python Scraping JavaScript using Selenium and Beautiful Soup

时光怂恿深爱的人放手 提交于 2019-12-17 19:22:29
问题 I'm trying to scrape a JavaScript enables page using BS and Selenium. I have the following code so far. It still doesn't somehow detect the JavaScript (and returns a null value). In this case I'm trying to scrape the Facebook comments in the bottom. (Inspect element shows the class as postText) Thanks for the help! from selenium import webdriver from selenium.common.exceptions import NoSuchElementException from selenium.webdriver.common.keys import Keys import BeautifulSoup browser =

Scraping javascript website in R

我的梦境 提交于 2019-12-17 18:00:26
问题 I want to scrape the match time and date from this url: http://www.scoreboard.com/game/rosol-l-goffin-d-2014/8drhX07d/#game-summary By using the chrome dev tools, I can see this appears to be generated using the following code: <td colspan="3" id="utime" class="mstat-date">01:20 AM, October 29, 2014</td> But this is not in the source html. I think this is because its java (correct me if Im wrong). How can I scrape this information using R? 回答1: So, RSelenium is not the only answer (anymore).

Webbrowser behaviour issues

做~自己de王妃 提交于 2019-12-17 16:53:46
问题 I am trying to automate Webbrowser with .NET C#. The issue is that the control or should I say IE browser behaves strange on different computers. For example, I am clickin on link and fillup a Ajax popup form on 1st computer like this, without any error: private void btn_Start_Click(object sender, RoutedEventArgs e) { webbrowserIE.Navigate("http://www.test.com/"); webbrowserIE.DocumentCompleted += fillup_LoadCompleted; } void fillup_LoadCompleted(object sender, System.Windows.Forms

What's the best approach for parsing XML/'screen scraping' in iOS? UIWebview or NSXMLParser?

吃可爱长大的小学妹 提交于 2019-12-17 16:30:10
问题 I am creating an iOS app that needs to get some data from a web page. My first though was to use NSXMLParser initWithContentsOfURL: and parse the HTML with the NSXMLParser delegate. However this approach seems like it could quickly become painful (if, for example, the HTML changed I would have to rewrite the parsing code which could be awkward). Seeing as I'm loading a web page I took take a look at UIWebView too. It looks like UIWebView may be the way to go.

Scrapy Python Set up User Agent

被刻印的时光 ゝ 提交于 2019-12-17 15:53:42
问题 I tried to override the user-agent of my crawlspider by adding an extra line to the project configuration file. Here is the code: [settings] default = myproject.settings USER_AGENT = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36" [deploy] #url = http://localhost:6800/ project = myproject But when I run the crawler against my own web, I notice the spider did not pick up my customized user agent but the default one "Scrapy/0.18.2 (

Programmatic Python Browser with JavaScript

空扰寡人 提交于 2019-12-17 15:35:48
问题 I want to screen-scrape a web-site that uses JavaScript. There is mechanize, the programmatic web browser for Python. However, it (understandably) doesn't interpret javascript. Is there any programmatic browser for Python which does? If not, is there any JavaScript implementation in Python that I could use to attempt to create one? 回答1: You might be better off using a tool like Selenium to automate the scraping using a web browser, so the JS executes and the page renders just like it would

How can I use Perl to grab text from a web page that is dynamically generated with JavaScript?

孤街浪徒 提交于 2019-12-17 10:49:50
问题 There is a website I am trying to pull information from in Perl, however the section of the page I need is being generated using javascript so all you see in the source is: <div id="results"></div> I need to somehow pull out the contents of that div and save it to a file using Perl/proxies/whatever. e.g. the information I want to save would be document.getElementById('results').innerHTML; I am not sure if this is possible or if anyone had any ideas or a way to do this. I was using a lynx

Click on a javascript link within python?

与世无争的帅哥 提交于 2019-12-17 10:31:41
问题 I am navigating a site using python's mechanize module and having trouble clicking on a javascript link for next page. I did a bit of reading and people suggested I need python-spidermonkey and DOMforms. I managed to get them installed by I am not sure of the syntax to actually click on the link. I can identify the code on the page as: <a href="javascript:__doPostBack('ctl00$MainContent$gvSearchResults','Page$2')">2</a> Does anyone know how to click on it? or if perhaps there's another tool.