screen-scraping | 易学教程

XPath select's in HTMLAgilityPack don't work as expected

阅读更多关于 XPath select's in HTMLAgilityPack don't work as expected

I'm writing simple screen scraping program in C#, for which i need to select all input's placed inside of one single form named "aspnetForm"(there is 2 forms on the page, and i don't want input's from another), and all inputs in this form placed inside different tables, div's, or just at first-child-level of this form. So i written really simple XPath query: //form[@id='aspnetForm']//input It's works as expected in all browsers that i tested(Chrome, IE, Firefox) - it returns what i want. But in HTMLAgilityPack it's not work at all - SelectNodes just always return NULL. This queries i've

How to extract links from a webpage using lxml, XPath and Python?

阅读更多关于 How to extract links from a webpage using lxml, XPath and Python?

I've got this xpath query: /html/body//tbody/tr[*]/td[*]/a[@title]/@href It extracts all the links with the title attribute - and gives the href in FireFox's Xpath checker add-on . However, I cannot seem to use it with lxml . from lxml import etree parsedPage = etree.HTML(page) # Create parse tree from valid page. # Xpath query hyperlinks = parsedPage.xpath("/html/body//tbody/tr[*]/td[*]/a[@title]/@href") for x in hyperlinks: print x # Print links in <a> tags, containing the title attribute This produces no result from lxml (empty list). How would one grab the href text (link) of a hyperlink

how to login through php curl which is submited by javascript i.e no submit button in form

阅读更多关于 how to login through php curl which is submited by javascript i.e no submit button in form

问题 I am trying to login a secure https website thorugh curl . my code is running successfully for other sites but some website where form is submitting through javascript its not working . currently i am using the following code for curl <? # Define target page $target = "https://www.domainname.com/login.jsf"; # Define the login form data $form_data="enter=Enter&username=webbot&password=sp1der3"; # Create the cURL session $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $target); // Define

Sending form data to aspx page

阅读更多关于 Sending form data to aspx page

There is a need to do a search on the website url = r'http://www.cpso.on.ca/docsearch/' this is an aspx page (I'm beginning this trek as of yesterday, sorry for noob questions) using BeautifulSoup, I can get the __VIEWSTATE and __EVENTVALIDATION like this: viewstate = soup.find('input', {'id' : '__VIEWSTATE'})['value'] eventval = soup.find('input', {'id' : '__EVENTVALIDATION'})['value'] and the header can be set like this: headers = {'HTTP_USER_AGENT': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.13) Gecko/2009073022 Firefox/3.0.13', 'HTTP_ACCEPT': 'text/html,application/xhtml+xml

How to perform feasible web smoke test with Selenium WebDriver?

阅读更多关于 How to perform feasible web smoke test with Selenium WebDriver?

问题 I have been doing some research on feasible and faster web page loading test with Selenium . A general idea of smoke testing is to click and navigate through the whole site to make sure the pages load properly. I was initially thinking to use some kind of ways to capture the http status code through some kind of http libraries since Selenium does not have any native support for that. But, I found it is not what I want since it will simply return Each and Every links of the site and most of

Can rapidminer extract xpaths from a list of URLS, instead of first saving the HTML pages?

阅读更多关于 Can rapidminer extract xpaths from a list of URLS, instead of first saving the HTML pages?

问题 I've recently discovered RapidMiner, and I'm very excited about it's capabilities. However I'm still unsure if the program can help me with my specific needs. I want the program to scrape xpath matches from an URL list I've generated with another program. (it has more options then the 'crawl web' operator in RapidMiner) I've seen the following tutorials from Neil Mcguigan: http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html. But the websites I try to scrape have

What's the fastest way to scrape a lot of pages in php?

阅读更多关于 What's the fastest way to scrape a lot of pages in php?

问题 I have a data aggregator that relies on scraping several sites, and indexing their information in a way that is searchable to the user. I need to be able to scrape a vast number of pages, daily, and I have ran into problems using simple curl requests, that are fairly slow when executed in rapid sequence for a long time (the scraper runs 24/7 basically). Running a multi curl request in a simple while loop is fairly slow. I speeded it up by doing individual curl requests in a background process

How to read someone else's forum

阅读更多关于 How to read someone else's forum

问题 My friend has a forum, which is full of posts containing information. Sometimes she wants to review the posts in her forum, and come to conclusions. At the moment she reviews posts by clicking through her forum, and generates a not necessarily accurate picture of the data (in her brain) from which she makes conclusions. My thought today was that I could probably bang out a quick Ruby script that would parse the necessary HTML to give her a real idea of what the data is saying. I am using Ruby

Where is the memory leak? How to timeout threads during multiprocessing in python?

阅读更多关于 Where is the memory leak? How to timeout threads during multiprocessing in python?

It is unclear how to properly timeout workers of joblib's Parallel in python. Others have had similar questions here , here , here and here . In my example I am utilizing a pool of 50 joblib workers with threading backend. Parallel Call (threading): output = Parallel(n_jobs=50, backend = 'threading') (delayed(get_output)(INPUT) for INPUT in list) Here, Parallel hangs without errors as soon as len(list) <= n_jobs but only when n_jobs => -1 . In order to circumvent this issue, people give instructions on how to create a timeout decorator to the Parallel function ( get_output(INPUT) ) in the

Watir Changing Mozilla Firefox Preferences

阅读更多关于 Watir Changing Mozilla Firefox Preferences

问题 I'm running a Ruby script using Watir to automate some things for me. I'm attempting to automatically save some files to a certain directory. So, in my Mozilla settings I set my default download directory to say the desktop and choose to automatically save files. These changes, however, are not reflected when I begin to run my script. It seems like the preferences revert back to default. I've included the following require "rubygems" # Optional. require "watir-webdriver" # For web automation.