htmlunit

HtmlUnit Only Displays Host HTML Page for GWT App

心不动则不痛 提交于 2019-12-17 09:52:24
问题 I am using HtmlUnit API to add crawler support to my GWT app as follows: PrintWriter out = null; try { resp.setCharacterEncoding(CHAR_ENCODING); resp.setContentType("text/html"); url = buildUrl(req); out = resp.getWriter(); WebClient webClient = webClientProvider.get(); // set options WebClientOptions options = webClient.getOptions(); options.setCssEnabled(false); options.setThrowExceptionOnScriptError(false); options.setThrowExceptionOnFailingStatusCode(false); options.setRedirectEnabled

Getting Jsoup to support dynamically generated html by JavaScript

岁酱吖の 提交于 2019-12-17 06:51:50
问题 right now I'm working on a webcrawler. This one should parse some specific sites and give me an output into an xml-file. Up to this point, it's no problem. The Crawler works and you can customize it realy quickly via a cfg-file. I use Jsoup to parse the HTML-content. I just added a few more sites and noticed that I got a huge problem with HTML-content that is created via JavaScript. Isn't there a way to make Jsoup supporting Javascript? Or at least get the full HTML-content I can see in my

option “setThrowExceptionOnScriptError(false)” NOT WORK in HtmlUnit! Why? (Java)

筅森魡賤 提交于 2019-12-16 18:04:45
问题 My problem in a topic. /I am use JDK+NetBeans/. So, I download HtmlUnit from http://sourceforge.net/projects/htmlunit/files/htmlunit/ any version between 2.9 -2.14 and no one not work with this function. Fore example my code (java): ..... import com.gargoylesoftware.htmlunit.AlertHandler; import com.gargoylesoftware.htmlunit.BrowserVersion; import com.gargoylesoftware.htmlunit.Page; import com.gargoylesoftware.htmlunit.ScriptPreProcessor; import com.gargoylesoftware.htmlunit.ScriptResult;

HtmlUnit 网络爬虫 菜鸟的学习笔记(二)

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-14 18:23:59
【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> 这次我以爬新浪微博为例,这个过程太纠结了,参考了好多大神的帖子,不过还是遗留了很多问题,我们慢慢来看,希望大神帮于指正,我的方法暂时来说还是比较挫的 登陆问题 爬新浪微博首先要登陆,之前爬的妹纸网站,由于不用登陆,所以没这一步,但是爬新浪微博我们必须要先登录,但是要涉及到一个问题,那就是验证码,验证码从我现在百度到的,和自己的理解,感觉暂时还是不能解决的,除非手工输入,因为本身验证码就是防止恶意登陆,防爬虫的,所以建议想试试的朋友用暂时用不输入验证码的账号试试( 关于验证码,期盼大神可以给些提示 ) 下面是demo代码 WebClient webClient = new WebClient(); webClient.getOptions().setJavaScriptEnabled(true); webClient.getOptions().setCssEnabled(false); webClient.setAjaxController(new NicelyResynchronizingAjaxController()); webClient.getOptions().setThrowExceptionOnScriptError(false); HtmlPage htmlPage = null; try {

模拟浏览器的神器

百般思念 提交于 2019-12-14 17:43:13
【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> 随着Web的发展,RIA越来越多,JavaScript和Complex AJAX Libraries给网络爬虫带来了极大的挑战,解析页面的时候需要模拟浏览器执行JavaScript才能获得需要的文本内容。 好在有一个Java开源项目 HtmlUnit ,它能模拟Firefox、IE、Chrome等浏览器 ,不但可以用来测试Web应用,还可以用来解析包含JS的页面以提取信息。 下面看看HtmlUnit的效果如何: 首先,建立一个maven工程,引入 junit 依赖和HtmlUnit依赖: <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.8.2</version> <scope>test</scope> </dependency> <dependency> <groupId>net.sourceforge.htmlunit</groupId> <artifactId>htmlunit</artifactId> <version>2.14</version> </dependency> 其次,写一个junit单元测试来使用HtmlUnit提取页面信息: /** *

模拟ajax实现网络爬虫——HtmlUnit

淺唱寂寞╮ 提交于 2019-12-14 17:34:19
【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> 最近在用Jsoup抓取某网站数据,可有些页面是ajax请求动态生成的,去群里问了一下,大神说模拟ajax请求即可。去网上搜索了一下,发现了这篇文章,拿过来先用着试试。转帖如下: 网上关于网络爬虫实现方式有很多种,但是很多都不支持Ajax,李兄说:模拟才是王道。确实,如果能够模拟一个没有界面的浏览器,还有什么不能做到的呢? 关于解析Ajax网站的框架也有不少,我选择了HtmlUnit,官方网站:http://htmlunit.sourceforge.net /,htmlunit可以说是一个Java版本的无界面浏览器,几乎无所不能,而且很多东西都封装得特别完美。这是这几天来积累下来的心血,记录一下。 package com.lanyotech.www.wordbank; import java.io.FileOutputStream; import java.io.IOException; import java.io.InputStream; import java.io.OutputStream; import java.net.MalformedURLException; import java.util.List; import com.gargoylesoftware.htmlunit

HtmlUnit form submit since button does not have a direct hyperlink

旧时模样 提交于 2019-12-13 20:31:05
问题 I have a button on a page but there is no hyperlink in the button. So I need to submit the form to go to the next page. HtmlUnit is not waiting till the next page loads. So the nextPage variable is having the current page instead of the next page (intermittently it works if page loads quick enough though). How to resolve this? Html Page: <form action="/webapp/NewPage.jsp" id="idForm01" accept-charset="UNKNOWN" onsubmit="return false;" name="frmNewPageForm" method="post" enctype="application/x

Terminat or Stop HtmlUnit

守給你的承諾、 提交于 2019-12-13 15:25:52
问题 I use htmlunit to test some website and I noticed that Htmlunit got stuck on some webpages. This problem is making the thread within which htmlunit was call from not terminating. Please do you know of any way to stop Htmlunit like in a real web browser where you will just click the browsers stop button. I want to stop/terminate Htmlunit when it is stuck/hangs while accessing a webpage. Thank you. 回答1: This should do it webClient.closeAllWindows(); 来源: https://stackoverflow.com/questions

Selenium Error when using JavaScript or getting elements

你离开我真会死。 提交于 2019-12-13 05:52:59
问题 Using Seleneium 2.25, I've had a lot of issues arise. I'm trying to use Selenium Remote Driver on a remote machine (Server) from my computer (local / client). However, when I try to use DesiresCapabilities.Htmlunit() It will locate the elements, but it says they are not visible. I'm completely stumped by this. I'm not sure why it can be found but then not visible. So then I tried to use some JavaScript in order force it. It comes back and throws an error saying that the webpage can not

HtmlUnit commenting out lines of facebook page

流过昼夜 提交于 2019-12-13 05:13:14
问题 I am trying to simulate the login process to my facebook page using HtmlUnit (and I do have good reasons to do the same). Here is my java code for the same: public static void main(String[] args) throws IOException { //tried to experiment with the browser types also. But to the same result //even using no param constructor does not help. WebClient webClient=new WebClient(BrowserVersion.CHROME); HtmlPage page1=webClient.getPage("https://www.facebook.com/bhramakarserver"); HtmlForm loginForm=