jsoup

jsoup times out, xml gets white space error, basic traversing through page is time consuming

五迷三道 提交于 2020-01-16 08:40:48
问题 I would like to make a program that parses the html page and selects useful information and displays it. I did it by opening a stream and then line by line searching for this appropriate content, but this is a time consuming process. So then I decided to do it by treating it as a xml and then using xpath. This I did by making a xml file on my system and loading the contents from the stream, and I got white space error, then I decide to direct open document as doc = (Document) builder.parse

extract loosly structured wikipedia text. html

亡梦爱人 提交于 2020-01-16 04:26:28
问题 Some of the html on wikipedia disambiguation pages is, shall we say, ambiguous, i.e. the links there that connect to specific persons named Corzine are difficult to capture using jsoup because they're not explicitly structured, nor do they live in a particular section as in this example. See the page Corzine page here. How can I get a hold of them? Is jsoup a suitable tool for this task? Perhaps I should use regex, but I fear doing that because I want it to be generalizable. </b> may refer to

jsoup - stop jsoup from making quotes into &

若如初见. 提交于 2020-01-16 04:12:36
问题 When I parse local HTML files jsoup changes quotes inside an anchor element to & obscuring my HTML. let's assume i want to change the value "one" to "two" in the following HTML part: <div class="pg2-txt1"> <a class="foo" appareantly_a_javascript_statement='{"targetId":"pg1-magn1", "ordinal":1}'>one</a> </div> what I get is: <div class="pg2-txt1"> <a class="foo" appareantly_a_javascript_statement="{"targetId":"pg1-magn1", "ordinal":1}">two</a> </div> The quotes inside the anchor element are

jsoup - stop jsoup from making quotes into &

倖福魔咒の 提交于 2020-01-16 04:12:14
问题 When I parse local HTML files jsoup changes quotes inside an anchor element to & obscuring my HTML. let's assume i want to change the value "one" to "two" in the following HTML part: <div class="pg2-txt1"> <a class="foo" appareantly_a_javascript_statement='{"targetId":"pg1-magn1", "ordinal":1}'>one</a> </div> what I get is: <div class="pg2-txt1"> <a class="foo" appareantly_a_javascript_statement="{"targetId":"pg1-magn1", "ordinal":1}">two</a> </div> The quotes inside the anchor element are

jsoup to login to a webite

二次信任 提交于 2020-01-16 01:19:12
问题 I am trying to use jsoup to get information after logging into "http://pawscas.usask.ca/cas-web/login". I've tried what's below and it doesn't seem to work, any help would be appreciated, thanks. Connection.Response res = null; try { res = Jsoup.connect("http://pawscas.usask.ca/cas-web/login") .data("username", "user") .data("password", "pass") //.data("It", "some data") //.data("execution", "some data") //.data("_eventId", "submit") .method(Method.POST) .execute(); } catch (IOException e) {

Date Format getting disturb when creating .CSV file in Java

偶尔善良 提交于 2020-01-16 01:10:30
问题 I am creating a web scraper and then store the data in the .CSV file. My program is running fine but, there is a problem that the website from where I am retrieving data have a date which is in (Month Day, Year) format. So when I save the data in .CSV file it will consider the Year as another column due to which all the data gets manipulated. I actually want to store that data into (MM-MON-YYYY) and store Validity date in one column. I am posting my code below. Kindly, help me out. Thanks! P

400 Http Errors Using Jsoup in Multithreaded Program

匆匆过客 提交于 2020-01-16 01:03:46
问题 I've created a program that parses html pages. I use jsoup connect function within a callable class inside ThreadPool. The problem is that I'm connecting to the same website and with a thread pool size of 5+, I get IO Exceptions - 400 errors. How do I not make that happen? 回答1: If you're getting a 400 HTTP response, check the content of the response for an error message. A 400 means a bad request of some kind: you didn't include all the required information or included malformed information.

Jsoup behavior when any HTML end tag is missing

淺唱寂寞╮ 提交于 2020-01-15 06:57:05
问题 What would be the default behavior of Jsoup whenever there is one missing HTML tag(either start tag or end tag)? Will it throw an error or would it ignore the existing tag or remove the existing tag? 回答1: When the end tag is missing, Jsoup will try doing its best and add it at the most sensible place conform the HTML5 spec. When the start tag is missing, Jsoup will remove the end tag. 来源: https://stackoverflow.com/questions/6931799/jsoup-behavior-when-any-html-end-tag-is-missing

“Exception in thread ”main“ java.lang.NullPointerException” error when running web scraper program

北战南征 提交于 2020-01-15 06:22:11
问题 I'm fairly new to web scraping and have limited knowledge on Java. Every time I run this code, I get the error: Exception in thread "main" java.lang.NullPointerException at sws.SWS.scrapeTopic(SWS.java:38) at sws.SWS.main(SWS.java:26) Java Result: 1 BUILD SUCCESSFUL (total time: 0 seconds) My code is: import java.io.*; import java.net.*; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; public class SWS { /** * @param args the command line arguments */ public static void main(String[]

Jsoup (Find Element)

徘徊边缘 提交于 2020-01-14 05:32:06
问题 Help solve the problem, it is necessary to pull some data from Wikipedia, I'll show them in the picture below: In the page code, these data are here: How to get this data? to do this is by using jsoup. I tried to do it like this: System.out.println(doc.select("div.mw-body-content > p ").first().text()); But the problem is that it so happens that this is not the first in code, and the second is for something: 回答1: Get the parent div by its ID (which should be unique): Elements parent = doc