htmlunit | 易学教程

使用HtmlUnit动态获取网页数据

阅读更多关于使用HtmlUnit动态获取网页数据

1.HtmlUnit是一个用java编写的无界面浏览器，建模html文档，通过API调用页面，填充表单，点击链接等等。如同正常浏览器一样操作。典型应用于测试以及从网页抓取信息。并且HtmlUnit拥有HttpClient和soup两者的功能，但速度比较慢，但如果取消它的解析css和js的功能，速度也会提上去，默认开启。 2.这里选用HtmlUnit来爬取数据主要是为了获取他的js和css. 3.主要代码如下 package com.los; import com.gargoylesoftware.htmlunit.WebClient; import com.gargoylesoftware.htmlunit.html.DomElement; import com.gargoylesoftware.htmlunit.html.DomNodeList; import com.gargoylesoftware.htmlunit.html.HtmlPage; import com.los.util.DownlandPic; import java.io.IOException; import java.util.regex.Pattern; public class HtmlUnitTest { public static void main(String[] args) throws

HtmlUnit accessing an element without id or Name

阅读更多关于 HtmlUnit accessing an element without id or Name

问题 How can I access this element: <input type="submit" value="Save as XML" onclick="some code goes here"> More info: I have to access programmatically a web page and simulate clicking on a button on it, which then will generate a xml file which I hope to be able to save on the local machine. I am trying to do so by using HtmlUnit libraries, but all examples I could find use getElementById() or getElementByName() methods. Unfortunately, this exact element doesn't have a name or Id, so I failed

How to ignore HTMLUnit warnings/errors related to jQuery?

阅读更多关于 How to ignore HTMLUnit warnings/errors related to jQuery?

Is it possible to teach HTMLUnit to ignore certain javascript scripts/files on a web page? Some of them are just out of my control (like jQuery) and I can't do anything with them. Warnings are annoying, for example: [WARN] com.gargoylesoftware.htmlunit.javascript.host.html.HTMLDocument: getElementById(script1299254732492) did a getElementByName for Internet Explorer Actually I'm using JSFUnit and HTMLUnit works under it. MrSmith42 If you want to avoid exceptions because of any JavaScript errors: webClient.setThrowExceptionOnScriptError(false); Well I am yet to find a way for that but I have

Restricting Selenium/Webdriver/HtmlUnit to a certain domain

阅读更多关于 Restricting Selenium/Webdriver/HtmlUnit to a certain domain

问题 While using selenium/webdriver for web scraping, I realized the target site has google analytics script running. Is there a way to restrict selenium/webdriver/htmlunit to avoid certain urls/domains ? Thanks, 回答1: I think it is impossible becouse Selenium is actually adapter for several implementation. So he can't deny to load some scripts to firefox or chrome. Perhaps you can check driver api(firefox profile, htmlunit configuration file) to accomplish this. 来源： https://stackoverflow.com

HtmlUnit ScriptException errors

阅读更多关于 HtmlUnit ScriptException errors

I am using HtmlUnitDriver,& here is my code. HtmlUnitDriver driver = new HtmlUnitDriver(true); driver.get("some url here"); I am getting following Exception: Caused by: com.gargoylesoftware.htmlunit.ScriptException: Wrapped com.gargoylesoftware.htmlunit.ScriptException: SyntaxError: missing ; before statement (http://sales.liveperson.net/hcp/html/mTag.js?site=7824460#1(eval)#1) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:595) at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:537) at net.sourceforge.htmlunit

Does HtmlUnit load images when it browses page?

阅读更多关于 Does HtmlUnit load images when it browses page?

问题 as above. Does it load images? 回答1: By default: no. You have to use htmlImage.getImageReader() Or, you can use htmlPage.save() Update: as of 2.25 , you can use: webClient.getOptions().setDownloadImages(true); 来源： https://stackoverflow.com/questions/3425697/does-htmlunit-load-images-when-it-browses-page

.switchTo().frame(<'frameId'>); not working with HtmlUnit Driver

阅读更多关于 .switchTo().frame(); not working with HtmlUnit Driver

I am kinda new to HtmlUnit and am having some trouble getting a "Setup" menu item which situated in the frame. Below code works perfectly fine for FireFox driver while fails for HtmlUnitDriver , HtmlUnitDriver driver = new HtmlUnitDriver(); driver.get(fleetWorkURL); WebElement usernameElement = driver.findElement(By.name("j_username")); usernameElement.sendKeys(username); WebElement passwordElement = driver.findElement(By.name("j_password")); passwordElement.sendKeys(password); WebElement logInButton = driver.findElement(By.className("button_acunia")); logInButton.click(); driver.switchTo()

HtmlUnit WebClient Timeout

阅读更多关于 HtmlUnit WebClient Timeout

In my previous questions about HtmlUnit Skip particular Javascript execution in HTML unit and Fetch Page source using HtmlUnit : URL got stuck I had mentioned that URL is getting stuck. I also found out that it is getting stuck due to one of the methods(parse) in HtmlUnit library is not coming out of execution. I did further work on this. I wrote code to get out of the method if it takes more than specified time-out seconds to complete. import java.io.IOException; import java.net.MalformedURLException; import java.util.Date; import java.util.concurrent.ExecutorService; import java.util

HtmlUnit Android problem with WebClient

阅读更多关于 HtmlUnit Android problem with WebClient

HtmlUnit is amazing, in Java at least I have had no problems with it. Unfortunately when switching the code over to the Android platform, it is giving me errors when I try to create a web-client. import android.app.Activity; import android.os.Bundle; import com.gargoylesoftware.htmlunit.WebClient; public class AndroidTestActivity extends Activity { /** Called when the activity is first created. */ @Override public void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout.main); final WebClient webClient = new WebClient(); } } Warnings it gives me

Running HtmlUnit with Jython - issue with startup on command line

阅读更多关于 Running HtmlUnit with Jython - issue with startup on command line

I tried to run HtmlUnit with Jython following this tutorial: http://blog.databigbang.com/web-scraping-ajax-and-javascript-sites/ but it does not work for me. I am unable to import the com.gargoylesoftvare packages, there are only some HTML files in HtmlUnit folder, which I need to import somehow? The tutorial says to run python script like this: /opt/jython/jython -J-classpath "htmlunit-2.8/lib/*" gartner.py and I try to run: java -jar /Users/adam/jython/jython.jar -J-classpath "htmlunit-2.8/lib/*" gartner.py My problem is I am getting an "Unknown option: J-classpath". But there is not even