mechanize

Mechanize Javascript

丶灬走出姿态 提交于 2019-12-04 17:59:34
I try to submit a form by Mechanize, however, I am not sure how to add necessary form valuables which are done by some Javascript. Since Mechanize does not support Javascript yet, and so I try to add the variables manually. The form source: <form name="aspnetForm" method="post" action="list.aspx" language="javascript" onkeypress="javascript:return WebForm_FireDefaultButton(event, '_ctl0_ContentPlaceHolder1_cmdSearch')" id="aspnetForm"> <input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" /> <input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value="" /> <input

Scrapy or Selenium or Mechanize to scrape web data?

天大地大妈咪最大 提交于 2019-12-04 17:46:11
问题 I want to scrape some data from a website. Basically, the website has some tabular display and shows around 50 records. For more records, the user has to click some button which makes an ajax call get & show the next 50 records. I have previous knowledge of Selenium webdriver(Python). I can do this very quickly in Selenium. But, Selenium is more kind of automation testing tool and it is very slow. I did some R&D and found that using Scrapy or Mechanize, I can also do the same thing. Should I

What pure Python library should I use to scrape a website?

假装没事ソ 提交于 2019-12-04 17:06:44
I currently have some Ruby code used to scrape some websites. I was using Ruby because at the time I was using Ruby on Rails for a site, and it just made sense. Now I'm trying to port this over to Google App Engine, and keep getting stuck. I've ported Python Mechanize to work with Google App Engine, but it doesn't support DOM inspection with XPATH. I've tried the built-in ElementTree, but it choked on the first HTML blob I gave it when it ran into '&mdash'. Do I keep trying to hack ElementTree in there, or do I try to use something else? thanks, Mark Beautiful Soup. lxml -- 100x better than

How to Get the Page Source with Mechanize/Nokogiri

心不动则不痛 提交于 2019-12-04 16:30:57
问题 I'm logged into a webpage/servlet using Mechanize. I have a page object jobShortListPg = agent.get(addressOfPage) When i use the following puts jobShortListPg I get the "mechanized" version of the page which I don't want e.g. #<Mechanize::Page::Link "Home" "blahICScriptProgramName=WEBLIB_MENU.ISCRIPT3.FieldFormula.IScript_DrillDown&target=main0&Level=0&RL=&navc=3171"> How do I get the html source of the page instead? 回答1: Use .body puts jobShortListPg.body 回答2: Use the content method of the

Ruby/Mechanize “failed to allocate memory”. Erasing instantiation of 'agent.get' method?

 ̄綄美尐妖づ 提交于 2019-12-04 16:06:32
I've got a little problem about leaking memory in a Mechanize Ruby script. I "while loop" multiple web pages access forever and memory increase a lot on each loop. That created a "failed to allocate memory" after minutes and made script exit. In fact, it seems that the agent.get method instantiate and hold the result even if I assign the result to the same "local variable" or even a "global variable". So I tried to assign nil to the variable after last used and before reusing the same name variable. But it seems that previous agent.get results are still in memory and really don't know how to

Raw HTML vs. DOM scraping in python using mechanize and beautiful soup

扶醉桌前 提交于 2019-12-04 15:03:34
I am attempting to write a program that, as an example, will scrape the top price off of this web page: http://www.kayak.com/#/flights/JFK-PAR/2012-06-01/2012-07-01/1adults First, I am easily able to retrieve the HTML by doing the following: from urllib import urlopen from BeautifulSoup import BeautifulSoup import mechanize webpage = 'http://www.kayak.com/#/flights/JFK-PAR/2012-06-01/2012-07-01/1adults' br = mechanize.Browser() data = br.open(webpage).get_data() soup = BeautifulSoup(data) print soup However, the raw HTML does not contain the price. The browser does...it's thing (clarification

Ruby Mechanize, Nokogiri and Net::HTTP

南笙酒味 提交于 2019-12-04 14:58:02
I am using Net::HTTP for HTTP requests and getting a response back: uri = URI("http://www.example.com") http = Net::HTTP.start(uri.host, uri.port, proxy_host, proxy_port) request = Net::HTTP::Get.new uri.request_uri response = http.request request # Net::HTTPResponse object body = response.body If I have to use the Nokogiri gem in order to parse this HTML response I will do: nokogiri_obj = Nokogiri::HTML(body) But if I want to use Mechanize gem I need to do this: agent = Mechanize.new mechanize_obj = agent.get("http://www.example.com") Is it possible for me to use Net::Http for getting the

Javascript (and HTML rendering) engine without a GUI for automation?

最后都变了- 提交于 2019-12-04 13:57:52
问题 Are there any libraries or frameworks that provide the functionality of a browser, but do not need to actually render physically onto the screen? I want to automate navigation on web pages (Mechanize does this, for example), but I want the full browser experience, including Javascript. Thus, I'd like to have a virtual browser of some sort, that I can use to "click on links" programmatically, have DOM elements and JS scripts render within it, and manipulate these elements. Solution preferably

I can't remove whitespaces from a string parsed by Nokogiri

心已入冬 提交于 2019-12-04 13:13:26
问题 I can't remove whitespaces from a string. My HTML is: <p class='your-price'> Cena pro Vás: <strong>139 <small>Kč</small></strong> </p> My code is: #encoding: utf-8 require 'rubygems' require 'mechanize' agent = Mechanize.new site = agent.get("http://www.astratex.cz/podlozky-pod-raminka/doplnky") price = site.search("//p[@class='your-price']/strong/text()") val = price.first.text => "139 " val.strip => "139 " val.gsub(" ", "") => "139 " gsub , strip , etc. don't work. Why, and how do I fix

UnicodeDecodeError problem with mechanize [duplicate]

别等时光非礼了梦想. 提交于 2019-12-04 13:04:25
This question already has answers here : How to determine the encoding of text? (9 answers) Closed 2 years ago . I receive the following string from one website via mechanize: 'We\x92ve' I know that \x92 stands for ’ character. I'm trying to convert that string to Unicode: >> unicode('We\x92ve','utf-8') UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 2: unexpected code byte What am I doing wrong? Edit: The reason I tried 'utf-8' was this: >> response = browser.response() >> response.info()['content-type'] 'text/html; charset=utf-8' Now I see I can't always trust content