htmlunit

Xpath to get the 2nd url with the matching text in the href tag

泄露秘密 提交于 2021-02-19 02:41:06
问题 A html page has paging links, 1 set at the top of the page and another on the bottom of the page. Using HtmlUnit, I am currently getting the HtmlAnchor on the page using getByAnchorText("1"); There is a problem in some of the links on the top, so I want to reference the bottom links using XPath. nextPageAnchor = (HtmlAnchor) page.getByXPath(""); How can I reference the 2nd link on the page, with using xpath? I need to reference the link using the AnchorText, so a link like: <a href="....">33<

Java program to download images from a website and display the file sizes

北城以北 提交于 2021-02-09 11:12:15
问题 I'm creating a java program that will read a html document from a URL and display the sizes of the images in the code. I'm not sure how to go about achieving this though. I wouldn't need to actually download and save the images, i just need the sizes and the order in which they appear on the webpage. for example: a webpage has 3 images <img src="dog.jpg" /> //which is 54kb <img src="cat.jpg" /> //which is 75kb <img src="horse.jpg"/> //which is 80kb i would need the output of my java program

Java program to download images from a website and display the file sizes

喜夏-厌秋 提交于 2021-02-09 11:08:28
问题 I'm creating a java program that will read a html document from a URL and display the sizes of the images in the code. I'm not sure how to go about achieving this though. I wouldn't need to actually download and save the images, i just need the sizes and the order in which they appear on the webpage. for example: a webpage has 3 images <img src="dog.jpg" /> //which is 54kb <img src="cat.jpg" /> //which is 75kb <img src="horse.jpg"/> //which is 80kb i would need the output of my java program

Prevent HtmlUnit 2.13 from executing JavaScript

心不动则不痛 提交于 2021-02-07 05:07:43
问题 Here is my code to get the page: WebClient webClient = new WebClient(); HtmlPage page = webClient.getPage(url); The problem is the webClient always executes javascript automatically and throws me a list of error. I just want to get the raw source. How can I prevent it from executing script? I've found there is a way in version 2.9: webClient.setJavaScriptEnabled(false); But setJavaScriptEnabled() function was deprecated. Anyone knows how to solve this problem? Please help me. Thank you so

Prevent HtmlUnit 2.13 from executing JavaScript

柔情痞子 提交于 2021-02-07 05:06:32
问题 Here is my code to get the page: WebClient webClient = new WebClient(); HtmlPage page = webClient.getPage(url); The problem is the webClient always executes javascript automatically and throws me a list of error. I just want to get the raw source. How can I prevent it from executing script? I've found there is a way in version 2.9: webClient.setJavaScriptEnabled(false); But setJavaScriptEnabled() function was deprecated. Anyone knows how to solve this problem? Please help me. Thank you so

HtmlUnit can't retrieve page after downloading a file

与世无争的帅哥 提交于 2021-02-06 12:52:20
问题 I'm having this weird problem with HtmlUnit in Java. I am using it to download some data from a website, the process is something like this: 1 - Login 2 - For each element (cars) ----- 3 Search for car ----- 4 Download zip file from a link The code: Creation of the webclient: webClient = new WebClient(BrowserVersion.FIREFOX_3_6); webClient.setJavaScriptEnabled(true); webClient.setThrowExceptionOnScriptError(false); DefaultCredentialsProvider provider = new DefaultCredentialsProvider();

set up python with htmlunit driver

久未见 提交于 2021-01-07 04:56:37
问题 i am trying to set up python with selenium htmlunit driver on ubuntu 18.04. launching selenium standalone server java -jar selenium-server-standalone-3.141.59.jar output 00:00:46.177 INFO [GridLauncherV3.parse] - Selenium server version: 3.141.59, revision:e82be7d358 00:00:46.299 INFO [GridLauncherV3.lambda$buildLaunchers$3] - Launching a standalone Selenium Server on port 4444 2020-05-25 00:00:46.366:INFO::main: Logging initialized @483ms to org.seleniumhq.jetty9.util.log.StdErrLog 00:00:46

项目:可视化分析(后端爬取数据部分)

岁酱吖の 提交于 2020-08-14 11:43:50
一:项目介绍 可视化分析项目是一个将唐诗三百首的详细内容录入到MySQL,再实现一个简单的前端页面将数据以图表的形式展现出来,方便用户直观感受每个作者的诗词创作数量,和所使用频繁的词语构成的词云图等。 二:项目构思 项目主要分为两大部分 后端爬取唐诗数据录入数据库部分 提取数据库信息并通过前端网页绘图展现 我们需要爬取的数据信息来自: 原因:唐诗三百首这个网站不收费,公开的。 思考:我们如何将这一首首诗的标题,朝代,作者,正文等信息存入到MySQL中? 1.获取列表页的html文档,通过运用htmlunit第三方库中的方法获取每首诗的url. 2.分析详情页,通过Xpath获取每首诗的标题,朝代,作者,正文。 3.使用Java原生加密类MessageDigest类中的SHA-256算法防止数据重复录入数据库。 4.使用ansj-seg第三方库中的NlpAnalysis类的parse(),来计算分词,为前端网页展现词云图做铺垫。 5.使用JDBC 将数据入库。 预研阶段及技术选型: HtmlUnit(网页爬取) HtmlUnit第三方库自带http client,可以帮助我们访问服务器资源,实现html页面的请求+解析。这个库下有一些方法getElementsByAttribute(),getAttribute(),getElementsByATagName(