mechanize | 易学教程

Mechanize Javascript

阅读更多关于 Mechanize Javascript

I try to submit a form by Mechanize, however, I am not sure how to add necessary form valuables which are done by some Javascript. Since Mechanize does not support Javascript yet, and so I try to add the variables manually. The form source: <form name="aspnetForm" method="post" action="list.aspx" language="javascript" onkeypress="javascript:return WebForm_FireDefaultButton(event, '_ctl0_ContentPlaceHolder1_cmdSearch')" id="aspnetForm"> <input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" /> <input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value="" /> <input

Scrapy or Selenium or Mechanize to scrape web data?

阅读更多关于 Scrapy or Selenium or Mechanize to scrape web data?

问题 I want to scrape some data from a website. Basically, the website has some tabular display and shows around 50 records. For more records, the user has to click some button which makes an ajax call get & show the next 50 records. I have previous knowledge of Selenium webdriver(Python). I can do this very quickly in Selenium. But, Selenium is more kind of automation testing tool and it is very slow. I did some R&D and found that using Scrapy or Mechanize, I can also do the same thing. Should I

What pure Python library should I use to scrape a website?

阅读更多关于 What pure Python library should I use to scrape a website?

I currently have some Ruby code used to scrape some websites. I was using Ruby because at the time I was using Ruby on Rails for a site, and it just made sense. Now I'm trying to port this over to Google App Engine, and keep getting stuck. I've ported Python Mechanize to work with Google App Engine, but it doesn't support DOM inspection with XPATH. I've tried the built-in ElementTree, but it choked on the first HTML blob I gave it when it ran into '&mdash'. Do I keep trying to hack ElementTree in there, or do I try to use something else? thanks, Mark Beautiful Soup. lxml -- 100x better than

How to Get the Page Source with Mechanize/Nokogiri

阅读更多关于 How to Get the Page Source with Mechanize/Nokogiri

问题 I'm logged into a webpage/servlet using Mechanize. I have a page object jobShortListPg = agent.get(addressOfPage) When i use the following puts jobShortListPg I get the "mechanized" version of the page which I don't want e.g. #<Mechanize::Page::Link "Home" "blahICScriptProgramName=WEBLIB_MENU.ISCRIPT3.FieldFormula.IScript_DrillDown&target=main0&Level=0&RL=&navc=3171"> How do I get the html source of the page instead? 回答1: Use .body puts jobShortListPg.body 回答2: Use the content method of the

Ruby/Mechanize “failed to allocate memory”. Erasing instantiation of 'agent.get' method?

阅读更多关于 Ruby/Mechanize “failed to allocate memory”. Erasing instantiation of 'agent.get' method?

I've got a little problem about leaking memory in a Mechanize Ruby script. I "while loop" multiple web pages access forever and memory increase a lot on each loop. That created a "failed to allocate memory" after minutes and made script exit. In fact, it seems that the agent.get method instantiate and hold the result even if I assign the result to the same "local variable" or even a "global variable". So I tried to assign nil to the variable after last used and before reusing the same name variable. But it seems that previous agent.get results are still in memory and really don't know how to

Raw HTML vs. DOM scraping in python using mechanize and beautiful soup

阅读更多关于 Raw HTML vs. DOM scraping in python using mechanize and beautiful soup

I am attempting to write a program that, as an example, will scrape the top price off of this web page: http://www.kayak.com/#/flights/JFK-PAR/2012-06-01/2012-07-01/1adults First, I am easily able to retrieve the HTML by doing the following: from urllib import urlopen from BeautifulSoup import BeautifulSoup import mechanize webpage = 'http://www.kayak.com/#/flights/JFK-PAR/2012-06-01/2012-07-01/1adults' br = mechanize.Browser() data = br.open(webpage).get_data() soup = BeautifulSoup(data) print soup However, the raw HTML does not contain the price. The browser does...it's thing (clarification

Ruby Mechanize, Nokogiri and Net::HTTP

阅读更多关于 Ruby Mechanize, Nokogiri and Net::HTTP

I am using Net::HTTP for HTTP requests and getting a response back: uri = URI("http://www.example.com") http = Net::HTTP.start(uri.host, uri.port, proxy_host, proxy_port) request = Net::HTTP::Get.new uri.request_uri response = http.request request # Net::HTTPResponse object body = response.body If I have to use the Nokogiri gem in order to parse this HTML response I will do: nokogiri_obj = Nokogiri::HTML(body) But if I want to use Mechanize gem I need to do this: agent = Mechanize.new mechanize_obj = agent.get("http://www.example.com") Is it possible for me to use Net::Http for getting the

Javascript (and HTML rendering) engine without a GUI for automation?

阅读更多关于 Javascript (and HTML rendering) engine without a GUI for automation?

问题 Are there any libraries or frameworks that provide the functionality of a browser, but do not need to actually render physically onto the screen? I want to automate navigation on web pages (Mechanize does this, for example), but I want the full browser experience, including Javascript. Thus, I'd like to have a virtual browser of some sort, that I can use to "click on links" programmatically, have DOM elements and JS scripts render within it, and manipulate these elements. Solution preferably

I can't remove whitespaces from a string parsed by Nokogiri

阅读更多关于 I can't remove whitespaces from a string parsed by Nokogiri

问题 I can't remove whitespaces from a string. My HTML is: <p class='your-price'> Cena pro Vás: <strong>139 <small>Kč</small></strong> </p> My code is: #encoding: utf-8 require 'rubygems' require 'mechanize' agent = Mechanize.new site = agent.get("http://www.astratex.cz/podlozky-pod-raminka/doplnky") price = site.search("//p[@class='your-price']/strong/text()") val = price.first.text => "139 " val.strip => "139 " val.gsub(" ", "") => "139 " gsub , strip , etc. don't work. Why, and how do I fix

UnicodeDecodeError problem with mechanize [duplicate]

阅读更多关于 UnicodeDecodeError problem with mechanize [duplicate]

This question already has answers here : How to determine the encoding of text? (9 answers) Closed 2 years ago . I receive the following string from one website via mechanize: 'We\x92ve' I know that \x92 stands for ’ character. I'm trying to convert that string to Unicode: >> unicode('We\x92ve','utf-8') UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 2: unexpected code byte What am I doing wrong? Edit: The reason I tried 'utf-8' was this: >> response = browser.response() >> response.info()['content-type'] 'text/html; charset=utf-8' Now I see I can't always trust content