mechanize | 易学教程

How do I integrate these two conditions block codes to mine in Ruby?

阅读更多关于 How do I integrate these two conditions block codes to mine in Ruby?

问题 How do I integrate these two conditions if my code scrapes without them? My code is working already, but it scrapes all rows (non-bold and bold values) and doesn't scrape the title attribute string. Condition 1: parses a table row only if one of its fields is bold: doc = Nokogiri::HTML(html) doc.xpath('//table[@class="articulos"]/tr[td[5]/p/b]').each do |row| puts row.at_xpath('td[3]/text()') end Condition2: gets only the number off the title attribute string : doc = Nokogiri::HTML(html)

Find table in an array with the most rows using Ruby, Nokogiri and Mechanize

阅读更多关于 Find table in an array with the most rows using Ruby, Nokogiri and Mechanize

问题 @p = mechanize.get(url) tables = @p.search('table.someclass') I'm basically going over about 200 pages, putting the tables in an array and the only way to sort is to find the table with the greatest number of rows. So I want to be able to look at each item in the array and select the first item with the greatest number of rows. I've been trying to use max_by but that won't work because I'm needing to search the table that is the array item, to find the tr.count. 回答1: Two ways: biggest =

Possible to use timeout in WWW::Mechanize on https?

阅读更多关于 Possible to use timeout in WWW::Mechanize on https?

问题 We have a Perl script which uses WWW::Mechanize to download content from a secured (https) website via our company proxy using POST action in WWW::Mechanize. Sometimes this post action runs for hours for unknown reasons. I want to control this. I checked for timeout but I also read in one of the post in Stackoverflow that it does not work with https websites. Any idea how I can use the timeout mechanism? I want to stop processing that link say after a minute or so to proceed further and not

Data scraping multiple page clicks loops

阅读更多关于 Data scraping multiple page clicks loops

问题 Trying to figure out a way to use one mechanise to scrape and add to arrays all of the data we want from the UCAS website. Currently we're struggling with coding in the link clicks for mechanise. Wondering if anyone can help, there are three successive link clicks amidst loops to progress through all search result pages. The first link to display all courses for university is within div class morecourseslink the second link to display course names, duration and qual is in div class

Python Mechanize Prevent Connection:Close

阅读更多关于 Python Mechanize Prevent Connection:Close

问题 I'm trying to use mechanize to get information from a web page. It's basically succeeding in getting the first bit of information, but the web page includes a button for "Next" to get more information. I can't figure out how to programmatically get the additional information. By using Live HTTP Headers, I can see the http request that is generated when I click the next button within a browser. It seems as if I can issue the same request using mechanize, but in the latter case, instead of

Get Mechanize to handle cookies from an arbitrary POST (to log into a website programmatically)

阅读更多关于 Get Mechanize to handle cookies from an arbitrary POST (to log into a website programmatically)

问题 I want to log into https://www.t-mobile.com/ programmatically. My first idea was to use Mechanize to submit the login form: alt text http://dl.dropbox.com/u/2792776/screenshots/2010-04-08_1440.png However, it turns out that this isn't even a real form. Instead, when you click "Log in" some javascript grabs the values of the fields, creates a new form dynamically, and submits it. "Log in" button HTML: <button onclick="handleLogin(); return false;" class="btnBlue" id="myTMobile-login"><span>Log

how to log into vBulletin 3.6 using mechanize (ruby)

阅读更多关于 how to log into vBulletin 3.6 using mechanize (ruby)

the html looks like below or you can find it here http://www.vbulletin.org/forum/index.php  <form action="login.php?do=login" method="post" onsubmit="md5hash(vb_login_password, vb_login_md5password, vb_login_md5password_utf, 0)"> <script type="text/javascript" src="clientscript/vbulletin_md5.js?v=3612"></script> <table cellpadding="0" cellspacing="1" border="0"> <tr> <td class="smallfont" align="left"><label for="navbar_username">User Name</label></td> <td class="smallfont" align="left" colspan="2"><label for="navbar_password">Password</label></td> </tr> <tr> <td><input type

Downloading pdf files using mechanize and urllib

阅读更多关于 Downloading pdf files using mechanize and urllib

问题 I am new to Python, and my current task is to write a web crawler that looks for PDF files in certain webpages and downloads them. Here's my current approach (just for 1 sample url): import mechanize import urllib import sys mech = mechanize.Browser() mech.set_handle_robots(False) url = "http://www.xyz.com" try: mech.open(url, timeout = 30.0) except HTTPError, e: sys.exit("%d: %s" % (e.code, e.msg)) links = mech.links() for l in links: #Some are relative links path = str(l.base_url[:-1])+str

Python Mechanize Browser: HTTP Error 460

阅读更多关于 Python Mechanize Browser: HTTP Error 460

问题 I am trying to log into a site using a mechanize browser and getting an HTTP 460 Error which appears to be a made up error so I'm not sure what to make of it. Here's the code: # Browser br = mechanize.Browser() # Cookie Jar cj = cookielib.LWPCookieJar() br.set_cookiejar(cj) # Browser options br.set_handle_equiv(True) br.set_handle_redirect(True) br.set_handle_referer(True) br.set_handle_robots(False) br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1) br.addheaders = [(

Verifying br.submit() using Python's Mechanize module

阅读更多关于 Verifying br.submit() using Python's Mechanize module

问题 Just trying to login to a website using mechanize. When I print "br.form", I can see my credentials entered into my form. But I do not know how to actually submit the form properly. I use "br.submit()" and attempt to verify it has proceeded to the next page by printing the br.title(), but the title appearing is for the login screen, and not the post-login screen. import mechanize from time import sleep def reportDownload(): # Prompt for login credentials print("We require your credentials.")