htmlunit

Can I configure HTMLUnit to only run specific javascript processes and not the whole thing?

一曲冷凌霜 提交于 2020-01-23 07:05:26
问题 I'm looking to gather information from a set of web pages that are all very similarly formatted. I need some information that is loaded onto the page by Javascript after opening. It seems that HTMLUnit is a pretty common tool to do this, so that's what I'm using. It's unfortunately very slow, which is a complaint I've seen across a lot of forums. The webClient.getPage() command is what is taking forever. When I turn off Javascript, it runs quickly, but I need to execute some Javascript

蓝奏云批量下载工具的实现思路笔记

狂风中的少年 提交于 2020-01-19 22:13:58
本文是针对我的工具 蓝奏云批量下载工具 的补充说明笔记,准备按照流程整理我实现软件的思路与方法。 涉及知识 Java的IO流 Java的下载文件 HtmlUnit的使用方法 okhttp的使用 分析与软件思路 在某一天,我找到了一部电子书的资源,但是,该蓝奏云地址是一个文件夹,由于蓝奏云不支持批量下载,所以我便是诞生了打造出一个批量下载的工具的念头,大概搞了五天吧终于是成功了,折腾其中的重定向下载就搞了两天,说多了都是泪啊... 按照顺序,一步步分析吧 首先,我浏览器打开了蓝奏云地址,这里有两种情况,一种是有提取码,一种是没有提取码的。 这个时候,我们需要可以自动模拟用户进行提交表单的操作 (代码详情见关键代码部分1) 经过网上的查找,发现了 HtmlUnit 这个开源库可以实现我们需要的操作。 HtmlUnit,说白了就是一个浏览器,这个浏览器是用Java写的无界面的浏览器,因为其没有界面,因此执行的速度还是妥妥的 之后我们便是来到这样的界面 我们需要解析当前并获得每个文件对应的蓝奏云地址,这里由于HTMLUnit内置了html元素选择器,我们可以使用HtmlUnit的选择器进行节点的过滤操作,得到文件信息以及对应的蓝奏云地址 (代码详情见关键代码部分2) 对了,这里有可能文件过多,会出现显示更多的按钮,这个情况我们也得考虑,我们可以使用HTMLUnit实现自动点击显示更多的按钮

Javascript onload event not firing in HtmlUnit

别来无恙 提交于 2020-01-17 07:23:27
问题 I need to load a page from an HTML string, not from a server. I use the method found in this answer : String html = "<html><head><script>" + "alert('hey 1'); " + "function OnLoadEvent() { alert('hey 2'); }" + "</script></head>" + "<body onload='OnLoadEvent()'></body></html>" URL url = new URL("http://www.example.com"); StringWebResponse response = new StringWebResponse(html, url); WebClient client = new WebClient() client.getOptions().setJavaScriptEnabled(true); HtmlPage page = HTMLParser

HtmlUnit ValidatorException: PKIX path building failed:

自古美人都是妖i 提交于 2020-01-16 14:31:33
[09:17:36:713] [ERROR] - com.xx.sea.util.HtmlUnitUtil.httpGetResponse(HtmlUnitUtil.java:95) - htmlunit err javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.ssl.Alerts.getSSLException(Alerts.java:192) ~[?:1.8.0_91] at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1949) ~[?:1.8.0_91] at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:302) ~[?:1.8.0_91] at sun.security.ssl.Handshaker.fatalSE

What Exception is thrown on timeout?

萝らか妹 提交于 2020-01-14 03:56:10
问题 What Exception is thrown on connection timeout in HTMLUnit ? 回答1: HtmlUnit uses the Apache HttpClient. The timeout mechanism throws an InterruptedIOException. See the HttpClient documentation. This exception is a subclass of IOException, which can be thrown during any HttpClient execute call (basically whenever you get a page with an HtmlUnit WebClient. 回答2: I think there is a bug, it really should throw a exception but dont throw if you set an timeout great than a value, you can see it in

HtmlUnit ScriptException errors

时光毁灭记忆、已成空白 提交于 2020-01-13 05:52:29
问题 I am using HtmlUnitDriver,& here is my code. HtmlUnitDriver driver = new HtmlUnitDriver(true); driver.get("some url here"); I am getting following Exception: Caused by: com.gargoylesoftware.htmlunit.ScriptException: Wrapped com.gargoylesoftware.htmlunit.ScriptException: SyntaxError: missing ; before statement (http://sales.liveperson.net/hcp/html/mTag.js?site=7824460#1(eval)#1) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:595) at

Running HtmlUnit with Jython - issue with startup on command line

南楼画角 提交于 2020-01-13 04:48:07
问题 I tried to run HtmlUnit with Jython following this tutorial: http://blog.databigbang.com/web-scraping-ajax-and-javascript-sites/ but it does not work for me. I am unable to import the com.gargoylesoftvare packages, there are only some HTML files in HtmlUnit folder, which I need to import somehow? The tutorial says to run python script like this: /opt/jython/jython -J-classpath "htmlunit-2.8/lib/*" gartner.py and I try to run: java -jar /Users/adam/jython/jython.jar -J-classpath "htmlunit-2.8

【HttpClient】HttpClient总结一之基本使用

◇◆丶佛笑我妖孽 提交于 2020-01-07 14:13:29
【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> 最近工作中是做了一个handoop的hdfs系统的文件浏览器的功能,是利用webhdfs提供的rest api来访问hdfs来与hdfs进行交互的,其中大量使用HttpClient,之前一直很忙,没什么时间来总结,今天闲下来了,可以来好好总结一下这个东西了。 1.HttpClient简介 http协议可以说是现在Internet上面最重要,使用最多的协议之一了,越来越多的java应用需要使用http协议来访问网络资源,特别是现在rest api的流行,HttpClient 是 Apache Jakarta Common 下的子项目,用来提供高效的、最新的、功能丰富的支持 HTTP 协议的客户端编程工具包,并且它支持 HTTP 协议最新的版本和建议。HttpClient 已经应用在很多的项目中,比如 Apache Jakarta 上很著名的另外两个开源项目 Cactus 和 HTMLUnit 都使用了 HttpClient,很多的java的爬虫也是通过HttpClient实现的,研究HttpClient对我们来说是非常重要的。 2.HttpClient不是浏览器 很多人觉得既然HttpClient是一个HTTP客户端编程工具,很多人把他当做浏览器来理解,但是其实HttpClient不是浏览器

How to locate element with tag alert inside outer div

落爺英雄遲暮 提交于 2020-01-07 03:15:40
问题 I trying to locate elemnts in this page and put it in Objects (DomElement) to making some tests of it, the problem is with elemnt reg-error-yid it is a inner-div inside div yid-field-suggestion, I tried to getElementById, byName, byXPath, and getFirstByXPath it's all not working , also I tried to change webClient with WebDriver and use driver.findElement(By.className("oneid-error-message")) it's also not working the elemnt of registered message <div id="reg-error-yid" class="oneid-error

Can i use HtmlUnit to listen for resource loading events?

时间秒杀一切 提交于 2020-01-05 03:00:08
问题 I'm trying to use HtmlUnit to detect resources (scripts, images, stylesheets, etc) that fail to load on a webpage. I've tried new WebConnectionWrapper(webClient) { @Override public WebResponse getResponse(WebRequest request) throws IOException { WebResponse response; response = super.getResponse(request); System.out.println(response.getStatusCode()); return response; } }; to no avail. It doesn't seem to handle CSS, images or JS, despite HtmlUnit logging: statusCode=[404] contentType=[text