htmlunit | 易学教程

How can I add cookies to HtmlUnit request header?

阅读更多关于 How can I add cookies to HtmlUnit request header?

I'm trying to access a site and I'm having trouble adding the "Cookie" collected to outgoing POST request header. I've been able to verify that they are present in the CookieManager. Any alternative means to HtmlUnit would also be appreciated. public static void main( String[] args ) { // Turn off logging to prevent polluting the output. Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(Level.OFF); try { final WebClient webClient = new WebClient(BrowserVersion.CHROME); webClient.getOptions().setCssEnabled(false); CookieManager cookieManager = webClient.getCookieManager(); out.println

Passing basic auth credentials with every request with HtmlUnit WebClient

阅读更多关于 Passing basic auth credentials with every request with HtmlUnit WebClient

I'm trying to write a simple smoke test for a web application. The application normally uses form based authentication, but accepts basic auth as well, but since the default is form based authentication, it never sends an authentication required, but instead just sends the login form. In the test I try to send the basic auth header using WebClient webClient = new WebClient(); DefaultCredentialsProvider creds = new DefaultCredentialsProvider(); // Set some example credentials creds.addCredentials("usr", "pwd"); // And now add the provider to the webClient instance webClient

How to detect when Selenium loads a browser's error page

阅读更多关于 How to detect when Selenium loads a browser's error page

问题 Is there a universal way to detect when a selenium browser opens an error page? For example, disable your internet connection and do driver.get("http://google.com") In Firefox, Selenium will load the 'Try Again' error page containing text like "Firefox can't establish a connection to the server at www.google.com." Selenium will NOT throw any errors. Is there a browser-independent way to detect these cases? For firefox (python), I can do if "errorPageContainer" in [ elem.get_attribute("id")

小记---------网页之htmlunit

阅读更多关于小记---------网页之htmlunit

HtmlUnit是一款开元的Java页面分析工具，可以有效的使用htmlunit分析页面大汉的内容，项目可以模拟浏览器运行，被誉为Java浏览器的开元实现，这个没有界面的浏览器 API的使用模拟特定浏览器例： WebClient webClient=new WebClient(BrowserVersion.FIREFOX_3); //模拟火狐浏览器查找特定元素通过get方法获取 HtmlPage page=WebClient.getPage("网址"); //获取到网页源代码 HtmlDivision div=(HtmlDivision)page.getElementById("hed"); //获取id属性值为hed的元素。通过Xpath获取、 HtmlDivision div=(HtmlDivision)page.getByXPath("//div").get(0); System.out.println(div.asXml()); //输出代码代理服务器的配置代理配置很简单，只需要配置好地址，端口，用户名与密码即可例：//创建对象 WebClient webClient=new WebClient(BrowserVersion.CHROME," http://127.0.0.1",8087); //模拟浏览器，代理IP地址端口号

Problem in HtmlUnit API for Java (Headless Browser)?

阅读更多关于 Problem in HtmlUnit API for Java (Headless Browser)?

问题 I am using HtmlUnit headless browser to browse this webpage (you can see the webpage to have a better understanding of the problem). I have set the select's value to "1" by the following commands final WebClient webClient = new WebClient(BrowserVersion.INTERNET_EXPLORER_7); try { // Configuring the webClient webClient.setJavaScriptEnabled(true); webClient.setThrowExceptionOnScriptError(false); webClient.setCssEnabled(true); webClient.setUseInsecureSSL(true); webClient.setRedirectEnabled(true)

How to use HtmlUnit in Java?

阅读更多关于 How to use HtmlUnit in Java?

I'm trying to use HtmlUnit in Java to log into a website. First i enter the user name then password. After that i need to select an option from a dropdown box. entering the user and password seemed to have worked but when i try to select the item from the drop down box i get errors. Can anyone help me fix this? My code is as follows: import com.gargoylesoftware.htmlunit.WebClient; import com.gargoylesoftware.htmlunit.html.HtmlElement; import com.gargoylesoftware.htmlunit.html.HtmlOption; import com.gargoylesoftware.htmlunit.html.HtmlPage; import com.gargoylesoftware.htmlunit.html.HtmlSelect;

How to create HtmlUnit HTMLPage object from String?

阅读更多关于 How to create HtmlUnit HTMLPage object from String?

This question was asked once already , but the API changed I guess and the answers are no valid anymore. URL url = new URL("http://www.example.com"); StringWebResponse response = new StringWebResponse("<html><head><title>Test</title></head><body></body></html>", url); HtmlPage page = HTMLParser.parseHtml(response, new TopLevelWindow("top", new WebClient())); System.out.println(page.getTitleText()); Can't be done because TopLevelWindow is protected and stuff like extending/implementing the window because of that is ridiculous :) Anybody has an idea how to do that ? It seems to me weird that it

scrapy实战之定向抓取某网店商品资料

阅读更多关于 scrapy实战之定向抓取某网店商品资料

网络爬虫（web crawler）又称为网络蜘蛛（web spider）是一段计算机程序，它从互联网上按照一定的逻辑和算法抓取和下载互联网的网页,是搜索引擎的一个重要组成部分。一般的爬虫从一部分start url开始，按照一定的策略开始爬取，爬取到的新的url在放入到爬取队列之中，然后进行新一轮的爬取，直到抓取完毕为止。我们看一下crawler一般会遇到什么样的问题吧：抓取的网页量很大网页更新量也很大，一般的网站，比如新闻，电子商务网站，页面基本是实时更新的大部分的网页都是动态的，多媒体，或者封闭的（facebook）海量网页的存在就意味着在一定时间之内，抓取只能的抓取其中的一部分，因此需要定义清楚抓取的优先级；网页更新的频繁，也就意味着需要抓取最新的网页和保证链接的有效性，因此一个更有可能带来新网页的列表页显得尤为重要；对于新闻网站，新的网站一般出现在首页，或者在指定的分类网页，但是对于淘宝来说，商品的更新就很难估计了；动态网页怎么办呢？现在的网页大都有JS和AJAX，抓取已经不是简单的执行wget下载，现代的网页结构需要我们的爬虫更加智能，需要更灵活的应对网页的各种情况。因此，对一个通用的爬虫个，我们要定义抓取策略，那些网页是我们需要去下载的，那些是无需下载的，那些网页是我们优先下载的，定义清楚之后，能节省很多无谓的爬取更新策略，监控列表页来发现新的页面

HtmlUnit + Selenium within Production

阅读更多关于 HtmlUnit + Selenium within Production

问题 I am currently using HtmlUnit and Selenium to drive it (WebDriver) within my production code. I am scaping and interacting with various websites programmatically with these libraries and am having some success and not experiencing memory issues (ensuring sessions are always cleaned up). I am wondering if these libraries are okay for a production environment or recommended against. This is difficult to find via Google due to the enormous amount of information about automated testing rather

Can't turn off HtmlUnit logging messages

阅读更多关于 Can't turn off HtmlUnit logging messages

I'm using HtmlUnit to interact with a web page that interacts with the server via Ajax. Soon after the Ajax code starts, HtmlUnit produces these two log messages: WARNING: Ignoring XMLHttpRequest.setRequestHeader for Content-length: it is a restricted header Mar 3, 2011 3:32:47 PM com.gargoylesoftware.htmlunit.javascript.host.xml.XMLHttpRequest jsxFunction_setRequestHeader WARNING: Ignoring XMLHttpRequest.setRequestHeader for Connection: it is a restricted header Mar 3, 2011 3:32:47 PM com.gargoylesoftware.htmlunit.javascript.host.xml.XMLHttpRequest jsxGet_status ...Followed by this message,

订阅 htmlunit